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Preface 


It has now been eleven years since the publication of the first edition of Robust 
Statistics: Theory and Methods in 2006. Since that time, there have been two 
developments prompting the need for a second edition. The first development is that 
since 2006 a number of new results in the theory and methods of robust statistics 
have been developed and published, in particular by the book’s authors. The second 
development is that the S-PLUS software has been superseded by the open source 
package R, so our original of the S-PLUS robust statistics package became outdated. 
Thus, for this second edition, we have created a new R-based package called 
RobStatTM, and in that package and at the publisher’s web site we provide scripts 
for computing all the examples in the book. 

We will now discuss the main research advances included in this second edition. 


Finite-sample robustness 


Asymptotically normal robust estimators have tuning constants that allow users to 
control their normal distribution variance efficiency, in a trade-off with robustness 
toward fat-tailed non-normal distributions. The resulting finite-sample performance 
in terms of mean-squared error (MSE), which takes into account bias as well as vari- 
ance, can be considerably worse than implied by the asymptotic performance. This 
second edition contains useful new results concerning the finite-sample MSE perfor- 
mance of robust linear regression and robust covariance estimators. These are briefly 
described below. 


Linear regression estimators 


A loss function with optimality properties is introduced in Section 5.8.1, and it is 
shown that its use gives much better results than the popular bisquare function, in 
both efficiency and robustness. 
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Section 5.9.3 focuses on finite-sample efficiency and robustness and introduces 
a new “distance-constrained maximum-likelihood” (DCML) estimator. The DCML 
estimator is shown to provide the best trade-off between finite-sample robustness and 
normal distribution efficiency, in comparison with an MM estimator that is asymp- 
totically 85% efficient, and an adaptive estimator, described in Section 5.9.2, that is 
asymptotically fully efficient for normal distributions. 


Multivariate location and scatter 


A number of proposed robust covariance matrix estimators were discussed in the first 
edition, and some comments about the choice of estimator were made. In this second 
edition, the new Section 6.10 “Choosing a location/scatter estimator” replaces the 
previous Section 6.8, and this new section provides new recommendations for 
choosing a robust covariance matrix estimator, based on extensive finite-sample 
performance simulation studies. 


Fast and reliable starting points for initial estimators 


The standard starting point for computing initial S-estimators for linear regression 
and covariance matrix estimators is based on a subsampling algorithm. Subsampling 
algorithms have two disadvantages: the first is that their computation time increases 
exponentially with the number of variables. The second disadvantage of the sub- 
sampling method is that the method is stochastic, which means that different final 
S-estimators and MM-estimators can occur when the computation is repeated. 


Linear regression 


Section 5.7.4 describes a deterministic algorithm due to Pefia and Yohai (1999) for 
obtaining a starting point for robust regression. Since this algorithm is determin- 
istic, it always yields the same final MM-estimator. This is particularly important 
in some applications, for example in financial risk calculations. Furthermore, it is 
shown in Section 5.7.6 that the Pefia—Yohai starting-value algorithm is much faster 
than the subsampling method, and has smaller maximum MSE than the subsampling 
algorithm, sometimes substantially so. 


Multivariate location and scatter 


Subsampling methods have also been used to get starting values for robust estimators 
of location and dispersion (scatter), but they have a similar difficulty as in linear 
regression, namely that they will be too slow when the number of variables is large. 
Fortunately, there is an improved algorithm for computing starting values due to Pefia 
and Prieto (2007), which makes use of finding projection directions of maximum and 
minimum kurtosis plus a set of random directions obtained by a “stratified sampling” 
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procedure. This method, which is referred to as the KSD method, is described in 
Section 6.9.2. While the KSD method is still stochastic in nature, it provides fast 
reliable starting values, and is more stable than ordinary subsampling, as is discussed 
in Sections 6.10.2 and 6.10.3. 


Robust regularized regression 


The use of penalized regression estimators to obtain good results for high-dimensional 
but sparse predictor variables has been a hot topic in the “machine learning” literature 
over the last decade or so. These estimators add L, and L, penalties to the least 
squares objective function; the leading estimators of this type are Lasso regression, 
Least Angle Regression, and Elastic Net regression, among others. A new section on 
robust regularized regression describes how to extend robust linear model regression 
to obtain robust versions of the above non-robust least-squares-based regularized 
regression estimators. 


Multivariate location and scatter estimation 
with missing data 


Section 6.12 provides a method for solving the problem of robust estimation of scatter 
and location with missing data. The method contains two main components. The first 
is the introduction of a generalized S-estimator of scatter and location that depends 
on Mahalanobis distances for the non-missing data in each observation. The second 
component is a weighted version of the well-known expectation-maximization (EM) 
algorithm for missing data. 


Robust estimation with independent outliers in variables 


The Tukey—Huber outlier-generating family of distribution models has been a com- 
monly accepted standard model for robust statistics research and associated empirical 
studies for independent and identically distributed data. In the case of multivariate 
data, the Tukey—Huber model describes the distribution of the rows, or “cases”, of a 
data matrix whose columns represent variables; outliers generated by this model are 
known as “case outliers”. However, there are important problems where outliers occur 
independently across cells — that is, across variables — in each row of a data matrix. 
For example with portfolios of stock returns, where the columns represent different 
stocks and the rows represent observations at different times, outlier returns in differ- 
ent stocks (representing idiosyncratic risk) occur independently across stocks; that 
1s, across cells/variables. 
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Section 6.13 discusses an important and relatively recent model for generating 
independent outliers across cells (across variables), called the independent contam- 
ination (IC) model. It turns out that estimators that have good robustness properties 
under the Tukey—Huber model have very poor robustness properties for the IC model. 
For example, estimators that have high breakdown points under the Tukey—Huber 
model can have very low breakdown points under the IC model. This section surveys 
the current state of research on robust methods for IC models, and on robust methods 
for simultaneously dealing with outliers from both Tukey—Huber and IC models. The 
problem of obtaining robust estimators that work well for both Tukey—Huber and IC 
models is an important ongoing area of research. 


Mixed linear models 


Section 6.15 discusses robust methods for mixed linear models. Two primary methods 
are discussed, the first of which is an S-estimator method that has good robustness 
properties for Tukey—Huber model case-wise outliers, but does not perform well for 
cell-wise independent outliers. The second method is designed to do well for both 
types of outliers, and achieves a breakdown point of 50% for Tukey—Huber models 
and 29% for IC models. 


Generalized linear models 


New material on a family of robust estimators has been added to the chapter on gen- 
eralized linear models (GLMs). These estimators are based on using M-estimators 
after a variance-stabilizing transformation has been applied to the response variable. 


Regularized robust estimators of the inverse covariance 
matrix 
In Chapter 6, on multivariate analysis, Section 6.14 looks at regularizing robust esti- 


mators of inverse covariance matrices in situations where the ratio of the number of 
variables to the number of cases is closer to or larger than one. 


A note on software and book web site 
The section on “Recommendations and software’ at the end of each chapter indicates 


the procedures recommended by the authors and the R functions that implement 
them. These functions are located in several libraries, in particular the R package 
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RobStatTM, which was especially developed for this book. All are available in the 
CRAN network (https://cran.r-project.org). 

The R scripts and datasets that enable the reader to reproduce the book’s examples 
are available at the book’s web site at www.wiley.com/go/maronna/robust where 
each dataset has the same name as the respective script. The scripts and data sets 
are also directly available in RobStatTM. The book web site also contains an errata 
document. 


Preface to the First Edition 


Why robust statistics are needed 


All statistical methods rely explicitly or implicitly on a number of assumptions. 
These assumptions generally aim at formalizing what the statistician knows or 
conjectures about the data analysis or statistical modeling problem he or she is faced 
with, while at the same time aim at making the resulting model manageable from 
the theoretical and computational points of view. However it is generally understood 
that the resulting formal models are simplifications of reality and that their validity 
is at best approximate. The most widely used model formalization is the assumption 
that the observed data has a normal (Gaussian) distribution. This assumption has 
been present in statistics for two centuries, and has been the framework for all the 
classical methods in regression, analysis of variance, and multivariate analysis. 
There have been attempts to justify the assumption of normality with theoretical 
arguments, such as the central limit theorem. These attempts however are easily 
proven wrong. The main justification for assuming a normal distribution is that it 
gives an approximate representation to many real data sets, and at the same time 
is theoretically quite convenient because it allows one to derive explicit formulae 
for optimal statistical methods such as maximum likelihood and likelihood ratio 
tests, as well as the sampling distribution of inference quantities such as t-statistics. 
We refer to such methods as classical statistical methods, and note that they rely 
on the assumption that normality holds exactly. The classical statistics are by 
modern computing standards quite easy to compute. Unfortunately theoretical and 
computational convenience does not always deliver an adequate tool for the practice 
of statistics and data analysis, as we shall see throughout this book. 

It often happens in practice that an assumed normal distribution model (e.g., a 
location model or a linear regression model with normal errors) holds approximately 
in that it describes the majority of observations, but some observations follow a dif- 
ferent pattern or no pattern at all. In the case when the randomness in the model is 
assigned to observational errors — as in astronomy which was the first instance of 
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the use of the least squares method — the reality is that while the behavior of many 
sets of data appeared rather normal, this held only approximately with the main dis- 
crepancy being that a small proportion of observations were quite atypical by virtue 
of being far from the bulk of the data. Behavior of this type is common across the 
entire spectrum of data analysis and statistical modeling applications. Such atypical 
data are called outliers, and even a single outlier can have a large distorting influence 
on a classical statistical method that is optimal under the assumption of normality or 
linearity. The kind of “approximately” normal distribution that gives rise to outliers 
is one that has a normal shape in the central region, but has tails that are heavier or 
*fatter” than those of a normal distribution. 

One might naively expect that if such approximate normality holds, then the 
results of using a normal distribution theory would also hold approximately. This 
is unfortunately not the case. If the data are assumed to be normally distributed but 
their actual distribution has heavy tails, then estimates based on the maximum likeli- 
hood principle not only cease to be “best” but may have unacceptably low statistical 
efficiency (unnecessarily large variance) if the tails are symmetric and may have very 
large bias if the tails are asymmetric. Furthermore, for the classical tests their level 
may be quite unreliable and their power quite low, and for the classical confidence 
intervals their confidence level may be quite unreliable and their expected confidence 
interval length may be quite large. 

The robust approach to statistical modeling and data analysis aims at deriving 
methods that produce reliable parameter estimates and associated tests and confi- 
dence intervals, not only when the data follow a given distribution exactly, but also 
when this happens only approximately in the sense just described. While the empha- 
sis of this book is on approximately normal distributions, the approach works as well 
for other distributions that are close to a nominal model, e.g., approximate gamma 
distributions for asymmetric data. A more informal data-oriented characterization of 
robust methods is that they fit the bulk of the data well: if the data contain no outliers 
the robust method gives approximately the same results as the classical method, while 
if a small proportion of outliers are present the robust method gives approximately the 
same results as the classical method applied to the “typical” data. As a consequence 
of fitting the bulk of the data well, robust methods provide a very reliable method of 
detecting outliers, even in high-dimensional multivariate situations. 

We note that one approach to dealing with outliers is the diagnostic approach. 
Diagnostics are statistics generally based on classical estimates, that aim at giving 
numerical or graphical clues for the detection of data departures from the assumed 
model. There is a considerable literature on outlier diagnostics, and a good outlier 
diagnostic is clearly better than doing nothing. However, these methods present two 
drawbacks. One is that they are in general not as reliable for detecting outliers as 
examining departures from a robust fit to the data. The other is that, once suspicious 
observations have been flagged, the actions to be taken with them remain the analyst’s 
personal decision, and thus there is no objective way to establish the properties of the 
result of the overall procedure. 
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Robust methods have a long history that can be traced back at least to the end of 
the 19th century with Simon Newcomb (see Stigler, 1973). But its first great steps for- 
ward occurred in the 60s and the early 70s with the fundamental work of John Tukey 
(1960, 1962), Peter Huber (1964, 1967) and Frank Hampel (1971, 1974). The appli- 
cability of the new robust methods proposed by these researchers was made possible 
by the increased speed and accessibility of computers. In the last four decades the 
field of robust statistics has experienced substantial growth as a research area, as evi- 
denced by a large number of published articles. Influential books have been written 
by Huber (1981), Hampel et al. (1986), Rousseeuw and Leroy (1987) and Staudte and 
Sheather (1990). The research efforts of the current book’s authors, many of which are 
reflected in the various chapters, were stimulated by the early foundation results, as 
well as work by many other contributors to the field, and the emerging computational 
opportunities for delivering robust methods to users. 

The above body of work has begun to have some impact outside the domain of 
robustness specialists, and there appears to be a generally increased awareness of 
the dangers posed by atypical data values and of the unreliability of exact model 
assumptions. Outlier detection methods are nowadays discussed in many textbooks 
on classical statistical methods, and implemented in several software packages. Fur- 
thermore by now several commercial statistical software packages offer some robust 
methods, with the Robust Library offering in S-PLUS being the currently most com- 
plete and user friendly. In spite of the increased awareness of the impact outliers 
can have on classical statistical methods and the availability of some commercial 
software, robust methods remain largely unused and even unknown by most of the 
communities of applied statisticians, data analysts, and scientists that might benefit 
from their use. It is our hope that this book will help rectify this unfortunate situation. 


Purpose of the book 


This book was written to stimulate the routine use of robust methods as a powerful 
tool to increase the reliability and accuracy of statistical modeling and data analysis. 
To quote John Tukey (1975a), who used the terms robust and resistant somewhat 
interchangeably: 

It is perfectly proper to use both classical and robust/resistant methods routinely, 


and only worry when they differ enough to matter. But when they differ, you should 
think hard. 


For each statistical model such as location, scale, linear regression, etc., there exist 
several if not many robust methods, and each method has several variants which an 
applied statistician, scientist or data analyst must choose from. To select the most 
appropriate method for each model it is important to understand how the robust meth- 
ods work, and their pros and cons. The book aims at enabling the reader to select and 
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use the most adequate robust method for each model, and at the same time to under- 
stand the theory behind the method; i.e., not only the “how” but also the “why”. Thus 
for each of the models treated in this book we provide: 


conceptual and statistical theory explanations of the main issues; 

the leading methods proposed to date and their motivations; 

a comparison of the properties of the methods; 

computational algorithms, and S-PLUS implementations of the different 
approaches; 

e recommendations of preferred robust methods, based on what we take to be reason- 
able trade-offs between estimator theoretical justification and performance, trans- 
parency to users, and computational costs. 


Intended audience 


The intended audience of this book consists of the following groups of individuals 
among the broad spectrum of data analysts, applied statisticians and scientists: 


e those who will be quite willing to apply robust methods to their problems once 

they are aware of the methods, supporting theory and software implementations; 

instructors who want to teach a graduate level course on robust statistics; 

graduate students wishing to learn about robust statistics; 

e graduate students and faculty who wish to pursue research on robust statistics and 
will use the book as background study. 


General prerequisites are basic courses in probability, calculus and linear alge- 
bra, statistics and familiarity with linear regression at the level of Weisberg (1985), 
Montgomery, Peck and Vining (2001), and Seber and Lee (2003). Previous knowl- 
edge of multivariate analysis, generalized linear models and time series are required 
for Chapters 6, 7 and 8, respectively. 


Organization of the Book 


There are many different approaches for each model in robustness, resulting in a huge 
volume of research and applications publications (though perhaps shorter on the latter 
than we might like). Doing justice to all of them would require an encyclopedic work 
that would not necessarily be very effective for our goal. Instead we concentrate on 
the methods we consider most sound according to our knowledge and experience. 
Chapter | is a data-oriented motivation chapter. Chapter 2 introduces the main 
methods in the context of location and scale estimation; in particular we concen- 
trate on the so-called M-estimates that will play a major role throughout the book. 
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Chapter 3 discusses methods for the evaluation of the robustness of model parame- 
ter estimates, and derives “optimal” estimates based on robustness criteria. Chapter 4 
deals with linear regression for the case where the predictors contain no outliers, typi- 
cally because they are fixed non-random values, including for example fixed balanced 
designs. Chapter 5 treats linear regression with general random predictors which may 
contain outliers in the form of so-called “leverage” points. Chapter 6 treats robust 
estimation of multivariate location and dispersion, and robust principal components. 
Chapter 7 deals with logistic regression and generalized linear models. Chapter 8 
deals with robust estimation of time series models, with a main focus on AR and 
ARIMA. Chapter 9 contains a more detailed treatment of the iterative algorithms for 
the numerical computing of M-estimates. Chapter 10 develops the asymptotic the- 
ory of some robust estimates, and contains proofs of several results stated in the text. 
Chapter 11 is an appendix containing descriptions of most data sets used in the book. 
Chapter 12 contains detailed instructions on the use of robust procedures written in 
S-PLUS. 

All methods are introduced with the help of examples with real data, The prob- 
lems at the end of each chapter consist of both theoretical derivations and analysis of 
other real data sets. 


How to read this book 


Each chapter can be read at two levels. The main part of the chapter explains the 
models to be tackled and the robust methods to be used, comparing their advantages 
and shortcomings through examples and avoiding technicalities as much as possible. 
Readers whose main interest is in applications should read enough of each chapter to 
understand which is the currently preferred method, and the reasons it is preferred. 
The theoretically oriented reader can find proofs and other mathematical details in 
appendices and in Chapter 9 and Chapter 10. Sections marked with an asterisk may 
be skipped at first reading. 


Computing 


A great advantage of classical methods is that they require only computational proce- 
dures based on well-established numerical linear algebra methods which are gener- 
ally quite fast algorithms. On the other hand computing robust estimates requires 
solving highly nonlinear optimization problems that typically involve a dramatic 
increase in computational complexity and running time. Most current robust meth- 
ods would be unthinkable without the power of today’s standard personal computers. 
Fortunately computers continue getting faster, have larger memory and are cheaper, 
which is good for the future of robust statistics. 
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Since the behavior of a robust procedure may depend crucially on the algorithm 
used, the book devotes considerable attention to algorithmic details for all the meth- 
ods proposed. At the same time in order that robust statistics be widely accepted by a 
wide range of users, the methods need to be readily available in commercial software. 
Robust methods have been implemented in several available commercial statistical 
packages, including S-PLUS and SAS. In addition many robust procedures have been 
implemented in the public-domain language R, which is similar to S. References for 
free software for robust methods are given at the end of Chapter 11. We have focused 
on S-PLUS because it offers the widest range of methods, and because the methods 
are accessible from a user-friendly menu and dialog user interface as well as from 
the command line. 

For each method in the book, instructions are given on how to compute it using 
S-PLUS in Chapter 11. For each example, the book gives the reference to the respec- 
tive dataset and the S-PLUS code that allow the reader to reproduce the example. 
Datasets and codes are to be found in the book’s web site http://www.wiley.com/go/ 
robuststatistics. This site will also contain corrections to any errata we subsequently 
discover, and clarifying comments and suggestions as needed. The authors will appre- 
ciate any feedback from readers that will result in posting additional helpful material 
on the web site. 


S-PLUS software download 


A time-limited version of S-PLUS for Windows software, that expires after 150 days, 
is being provided by Insightful for this book. To download and install the S-PLUS 
software, follow the instructions at http://www. insightful.com/support/splusbooks/ 
robstats. 

To access the web page, the reader must provide a password. The password is the 
web registration key provided with this book as a sticker on the inside back cover. In 
order to activate S-PLUS for Windows the reader must use the web registration key. 
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Scan this QR code to visit the companion website. 


Introduction 


1.1. Classical and robust approaches to statistics 


This introductory chapter is an informal overview of the main issues to be treated in 
detail in the rest of the book. Its main aim is to present a collection of examples that 
illustrate the following facts: 


e Data collected in a broad range of applications frequently contain one or more atyp- 
ical observations, known as outliers; that is, observations that are well-separated 
from the majority or “bulk” of the data, or in some way deviate from the general 
pattern of the data. 

e Classical estimates, such as the sample mean, the sample variance, sample 
covariances and correlations, or the least-squares fit of a regression model, can be 
adversely influenced by outliers, even by a single one, and therefore often fail to 
provide good fits to the bulk of the data. 

e There exist robust parameter estimates that provide a good fit to the bulk of the 
data when the data contains outliers, as well as when the data is free of them. A 
direct benefit of a good fit to the bulk of data is the reliable detection of outliers, 
particularly in the case of multivariate data. 


In Chapter 3 we shall provide some formal probability-based concepts and defini- 
tions of robust statistics. Meanwhile, it is important to be aware of the following per- 
formance distinctions between classical and robust statistics at the outset. Classical 
statistical inference quantities such as confidence intervals, f-statistics and p-values, 
R? values and model selection criteria in regression can be adversely influenced by 
the presence of even one outlier in the data. In contrast, appropriately constructed 
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robust versions of those inference quantities are little influenced by outliers. Point 
estimate predictions and their confidence intervals based on classical statistics can be 
spoiled by outliers, while predictive models fitted using robust statistics do not suffer 
from this disadvantage. 

It would, however, be misleading to always think of outliers as “bad” data. 
They may well contain unexpected, but relevant information. According to Kandel 
(1991, p. 110): 


The discovery of the ozone hole was announced in 1985 by a British team working on 
the ground with “conventional” instruments and examining its observations in detail. 
Only later, after reexamining the data transmitted by the TOMS instrument on NASA’s 
Nimbus 7 satellite, was it found that the hole had been forming for several years. Why 
had nobody noticed it? The reason was simple: the systems processing the TOMS data, 
designed in accordance with predictions derived from models, which in turn were estab- 
lished on the basis of what was thought to be “reasonable”, had rejected the very (“exces- 
sively”) low values observed above the Antarctic during the Southern spring. As far as 
the program was concerned, there must have been an operating defect in the instrument. 


In the next sections we present examples of classical and robust estimates of the 
mean, standard deviation, correlation and linear regression for data containing out- 
liers. Except in Section 1.2, we do not describe the robust estimates in any detail, and 
return to their definitions in later chapters. 


1.2 Mean and standard deviation 


Let x = (x1,%,...,x,,) be a set of observed values. The sample mean x and sample 
standard deviation (SD) s are defined by 


San 2. 1 =)2 
r= — Da, 5 =a (1.1) 


The sample mean is just the arithmetic average of the data, and as such one might 
expect that would provide a good estimate of the center or location of the data. Like- 
wise, one might expect that the sample SD would provide a good estimate of the 
dispersion of the data. Now we shall see how much influence a single outlier can 
have on these classical estimates. 


Example 1.1 Consider the following 24 determinations of the copper content in 
wholemeal flour (in parts per million), sorted in ascending order (Analytical Methods 
Committee, 1989): 


2.20 2.20 240 240 2.50 2.70 2.80 2.90 
3.03 3.03 3.10 3.37 340 340 340 3.50 
3.60 3.70 3.70 3.70 3.70 3.77 5.28 28.95 


The value 28.95 immediately stands out from the rest of the values and would 
be considered an outlier by almost anyone. One might conjecture that this 
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Sample median Sample mean 
WITH outlier WITH outlier 


Outlier at 28.95 —> 


| 
2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50 7.00 


Sample mean Sample median 
WITHOUT WITHOUT 
outlier outlier 


Figure 1.1 Copper content of flour data with sample mean and sample median 
estimates 


inordinately large value was caused by a misplaced decimal point with respect to 
a “true” value of 2.895. In any event, it is a highly influential outlier, as we now 
demonstrate. 

The values of the sample mean and SD for the above dataset are x = 4.28 and 
s = 5,30, respectively. Since x = 4.28 is larger than all but two of the data values, 
it is not among the bulk of the observations and as such does not represent a good 
estimate of the center of the data. If one deletes the suspicious value of 28.95, then 
the values of the sample mean and sample SD are changed to x = 3.21 and s = 0.69. 
Now the sample mean does provide a good estimate of the center of the data, as 
is clearly shown in Figure 1.1, and the SD is over seven times smaller than it was 
with the outlier present. See the leftmost upward pointing arrow and the rightmost 
downward-pointing arrow in Figure 1.1. 

Let us consider how much influence a single outlier can have on the sample mean 
and sample SD. For example, suppose that the value 28.95 is replaced by an arbitrary 
value x for the 24th observation, x,,. It is clear from the definition of the sample mean 
that by varying x from —oo to +00 the value of the sample mean changes from —oo to 
+oo. It is an easy exercise to verify that as x ranges from —co to +00, the sample SD 
ranges from some positive value smaller than that based on the first 23 observations 
to +oo. Thus we can say that a single outlier has an unbounded influence on these 
two classical statistics. 

An outlier may have a serious adverse influence on confidence intervals. For 
the flour data, the classical interval based on the f-distribution with confidence level 
0.95 is (2.05, 6.51); after removing the outlier, the interval is (2.91, 3.51). The 
impact of the single outlier has been to considerably lengthen the interval in an 
asymmetric way. 

This example suggests that a simple way to handle outliers is to detect them 
and remove them from the dataset. There are many methods for detecting outliers 
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(see, for example, Barnett and Lewis, 1998). Deleting an outlier, although better than 
doing nothing, still poses a number of problems: 


e When is deletion justified? Deletion requires a subjective decision. When is an 
observation “outlying enough” to be deleted? 

e The user or the author of the data may think that “an observation is an observa- 
tion” (i.e., observations should speak for themselves) and hence feel uneasy about 
deleting them 

e Since there is generally some uncertainty as to whether an observation is really 
atypical, there is a risk of deleting “good” observations, which would result in 
underestimating data variability 

e Since the results depend on the user’s subjective decisions, it is difficult to deter- 
mine the statistical behavior of the complete procedure. 


We are thus lead to another approach: why use the sample mean and SD? Maybe are 
there other better possibilities? 

One very old method for estimating the “middle” of the data is to use the sample 
median. Any number ft with a value such that the numbers of observations on both 
sides of it are equal is called a median of the dataset: t is a median of the data set 
X=(x,,...,%,), and will be denoted by 


t = Med(x), if #{x; > t} = #{x; < ¢}, 


where #{A} denotes the number of elements of the set A. It is convenient to define the 
sample median in terms of the order statistics (X(1), (2), ---, Xin), obtained by sorting 
the observations x = (x), ....,x,,) in increasing order so that 


If n is odd, then n = 2m— 1 for some integer m, and in that case Med(x) = Xen): 
If n is even, then n = 2m for some integer m, and then any value between Xm) and 
X(m41) Satisfies the definition of a sample median, and it is customary to take 


Xm) + X(m+ 1) 


2 


However, in some cases (e.g. in Section 4.5.1) it may be more convenient to choose 
Xm) OF Xon41) Clow” and “high” medians, respectively). 

The mean and the median are approximately equal if the sample is symmetri- 
cally distributed about its center, but not necessarily otherwise. In our example, the 
median of the whole sample is 3.38, while the median without the largest value is 3.37, 
showing that the median is not much affected by the presence of this value. See the 
locations of the sample median with and without the outlier present in Figure 1.1 
above. Notice that for this sample, the value of the sample median with the outlier 
present is relatively close to the sample mean value of 3.21 with the outlier deleted. 


Med(x) = 
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Suppose again that the value 28.95 is replaced by an arbitrary value x for the 24th 
observation xX,>4). It is clear from the definition of the sample median that when x 
ranges from —oo to +00 the value of the sample median does not change from —oo 
to +00 as was the case for the sample mean. Instead, when x goes to —co the sample 
median undergoes the small change from 3.38 to 3.23 (the latter being the average of 
Xq1) = 3.10 and x19) = 3.37 in the original dataset); when x goes to +00 the sample 
median goes to the value 3.38 given above for the original data. Since the sample 
median fits the bulk of the data well, with or without the outlier, and is not much 
influenced it, it is a good robust alternative to the sample mean. 

Likewise, one robust alternative to the SD is the median absolute deviation about 
the median (MAD), defined as 


MAD(x) = MAD(x;, x5, «.sX,) = Med{ |x — Med(x)|}. 


This estimator uses the sample median twice, first to get an estimate of the center 
of the data in order to form the set of absolute residuals about the sample median, 
{|x — Med(x)|}, and then to compute the sample median of these absolute resid- 
uals. To make the MAD comparable to the SD, we define the normalized MAD 
(MADN) as 

MAD(x) 

0.6745 © 


The reason for this definition is that 0.6745 is the MAD of a standard normal random 
variable, and hence a N(, 0”) variable has MADN =o. 

For the above dataset, one gets MADN = 0.53, as compared with s = 5.30. 
Deleting the large outlier yields MADN = 0.50, as compared to the somewhat higher 
sample SD value of s = 0.69. The MAD is clearly not influenced very much by the 
presence of a large outlier, and as such provides a good robust alternative to the 
sample SD. 

So why not always use the median and MAD? An informal explanation is that 
if the data contain no outliers, these estimates have a statistical performance that is 
poorer than that of the classical estimates x and s. The ideal solution would be to 
have “the best of both worlds”: estimates that behave like the classical ones when 
the data contain no outliers, but are insensitive to outliers otherwise. This is the 
data-oriented idea of robust estimation. A more formal notion of robust estimation 
based on statistical models, which will be discussed in the following chapters, is that 
the statistician always has a statistical model in mind (explicitly or implicitly) when 
analyzing data, for example a model based on a normal distribution or some other ide- 
alized parametric model such as an exponential distribution. The classical estimates 
are in some sense “optimal” when the data are exactly distributed according to the 
assumed model, but can be very suboptimal when the distribution of the data differs 
from the assumed model by a “small” amount. Robust estimates on the other hand 
maintain approximately optimal performance, not just under the assumed model, but 
under “small” perturbations of it too. 


MADN(x) = 
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1.3 The “three sigma edit” rule 


A traditional measure of the outlyingness of an observation x; with respect to a 
sample, is the ratio between its distance to the sample mean and the sample SD: 


t= 2, (1.3) 


Observations with |t,| > 3 are traditionally deemed suspicious (the “three-sigma 
tule”), based on the fact that they would be “very unlikely” under normality, since 
P(|x| > 3) = 0.003 for a random variable x with a standard normal distribution. 
The largest observation in the flour data has t; = 4.65, and so is suspicious. 
Traditional “three-sigma edit” rules result in either discarding observations for 
which |f;| > 3, or adjusting them to one of the values x + 3s, whichever is nearer. 
Despite its long tradition, this rule has some drawbacks that deserve to be taken into 
account: 


e In a very large sample of “good” data, some observations will be declared 
suspicious and be altered. More precisely, in a large normal sample, about three 
observations out of 1000 will have |t;| > 3. For this reason, normal Q—Q plots are 
more reliable for detecting outliers (see example below). 

In very small samples the rule is ineffective: it can be shown that 


n-1l 
|t;| < 


n 


for all possible data sample values, and hence if n < 10 then always |t;| < 3. The 
proof is left to the reader (Problem 1.3). 

e When there are several outliers, their effects may interact in such a way that 
some or all of them remain unnoticed (an effect called masking), as the following 
example shows. 


Example 1.2 The following data (Stigler 1977) are 20 determinations of the time 
(in microseconds) needed for the light to travel a distance of 7442 m. The actual times 
are the table values X 0.001 + 24.8. 


28 26 33 24 34 -44 27 16 40 —-2 
29, 22 24 21° 25 30 23 29 31 19 


The normal Q-Q plot in Figure 1.2 reveals the two lowest observations 
(—44 and —2) as suspicious. Their respective f;s are —3.73 and —1.35, and so the 
value of |t;| for the observation —2 does not indicate that it is an outlier. The reason 
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Figure 1.2 Velocity of light: Q—Q plot of observed times 


that —2 has such a small |f;| value is that both observations pull x to the left and 
inflate s; it is said that the value —44 “masks” the value —2. 

To avoid this drawback it is better to replace x and s in (1.3) by robust location and 
dispersion measures. A robust version of (1.3) can be defined by replacing the sample 
mean and SD by the median and MADN, respectively: 


ee ie Med(x) 


i= MADNGO * il 


The ¢,s for the two leftmost observations are now —11.73 and —4.64, and hence 
the three-sigma edit rule, with ¢’ instead of t, pinpoints both as suspicious. This 
suggests that even if we only want to detect outliers — rather than to estimate param- 
eters — detection procedures based on robust estimates are more reliable. 

A simple robust location estimate could be defined by deleting all observations 
with |; | larger than a given value, and taking the average of the rest. While 
this procedure is better than the three-sigma edit rule based on f¢, it will be seen 
in Chapter 3 that the estimates proposed in this book handle the data more 
smoothly, and can be tuned to have certain desirable robustness properties that this 
procedure lacks. 
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1.4 Linear regression 


1.4.1 Straight-line regression 


First consider fitting a straight line regression model to the dataset {(x,,y,) : 
i= 1,.,n} 
yj =aA+xP + uj, PH leew 


where x; and y, are the predictor and response variable values, respectively, and u; are 
random errors. The time-honored classical way of fitting this model is to estimate the 
parameters a and f with the least-squares (LS) estimates 


Lei — x)(y; — y) 


As an example of how influential two outliers can be on these estimates, 
Figure 1.3 plots the earnings per share (EPS) versus time each year for a company 
with the stock exchange ticker symbol IVENSYS, along with the straight-line fits 
of the LS estimate and of a robust regression estimate (called an MM-estimate) that 
has desirable theoretical properties (to be described in detail in Chapter 5). 
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Figure 1.3 EPS data with robust and LS fits 
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The two unusually low EPS values in 1997 and 1998 cause the LS line to fit the 
data very poorly, and one would not expect the line to provide a good prediction of 
EPS in 2001. By way of contrast, the robust line fits the bulk of the data well, and 
should provide a reasonable prediction of EPS in 2001. 

The above EPS example was brought to one of the author’s attention by an analyst 
in the corporate finance department of a well-known large company. The analyst was 
required to produce a prediction of next year’s EPS for several hundred companies, 
and at first he used the LS fit for this purpose. But then he noticed a number of firms for 
which the data contained outliers that distorted the LS parameter estimates, resulting 
in a very poor fit and a poor prediction of next year’s EPS. Once he discovered the 
robust estimate, and found that it gave him essentially the same results as the LS 
estimate when the data contained no outliers, while at the same time providing a 
better fit and prediction than LS when outliers were present, he began routinely using 
the robust estimate for his task. 

It is important to note that automatically flagging large differences between a 
classical estimate (in this case LS) and a robust estimate provides a useful diagnostic 
alert that outliers may be influencing the LS result. 


1.4.2 Multiple linear regression 
Now consider fitting a multiple linear regression model 


Pp 


yi = DY xyBi + up, P= Legh 


j=l 


where the response variable values are y;, and there are p predictor variables x;,, 
j=1,...,p, and p regression coefficients #;. Not surprisingly, outliers can also 
have an adverse influence on the LS estimate B for this general linear model, 
a fact which is illustrated by the following example that appears in Hubert and 
Rousseeuw (1997). 


Example 1.3 The response variable values y, are the rates of unemployment in 
various geographical regions around Hannover, Germany, and the predictor vari- 
ables x;;, j = 1,...,p are as follows: 


e PA: percentage engaged in production activities 

e GPA: growth in PA 

e HS: percentage engaged in higher services 

e GHS: growth in HS 

e Region: geographical region around Hannover (21 regions) 

e Period: time period (three periods: 1979-82, 1983-88, 1989-92) 


Note that the categorical variables Region and Period require 20 and 2 param- 
eters respectively, so that, including an intercept, the model has 27 parameters, 
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Standardized Residuals vs. Index (Time) 


Standardized Residuals 


54 


0) 10 20 30 40 50 60 


Index (Time) 


Figure 1.4 Standardized residuals for LS and robust fits 


and the number of response observations is 63, one for each region and period. 
Figures 1.4 and 1.5 show the results of LS and robust fitting in a manner that 
facilitates easy comparison of the results. The robust fitting is done by a special 
“M-estimate” that has desirable theoretical properties, and is described in detail in 
Section 5.7.5. 

For a set of estimated parameters (Bi. ating B,). with fitted values J, = 8 : xiiBis 
residuals %, = y,; — J, and residuals dispersion estimate 6, Figure 1.4 shows the 
standardized residuals i; = ;/@ plotted versus the observations’ index values i. 
Standardized residuals that fall outside the horizontal dashed lines at +2.33, which 
occurs with probability 0.02, are declared suspicious. The display for the LS fit does 
not reveal any outliers, while that for the robust fit clearly reveals 10 to 12 outliers 
among 63 observations. This is because the robust regression has found a linear 
relationship that fits the majority of the data points well, and consequently is able to 
reliably identify the outliers. The LS estimate instead attempts to fit all data points 
and so is heavily influenced by the outliers. The fact that all of the LS standardized 
residuals lie inside the horizontal dashed lines is because the outliers have inflated 
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Figure 1.5 Normal Q-Q plots for (left) LS and (right) robust fits 


the value of 6 computed in the classical way based on the sum of squared residuals, 
while a robust estimate 6 used for the robust regression is not much influenced by the 
outliers. 

Figure 1.5 shows normal Q-Q plots of the residuals for the LS and robust fits, 
with light dotted lines showing the 95% simulated pointwise confidence regions to 
allow an assessment of whether or not there are significant outliers and potential 
nonnormality. These plots may be interpreted as follows. If the data fall along the 
straight line (which itself is fitted by a robust method) with no points outside the 95% 
confidence region, then one is moderately sure that the data are normally distributed. 

Performing only the LS fit, and therefore looking only at the normal Q-Q plot in 
the left-hand plot in Figure 1.5, would lead to the conclusion that the residuals are 
indeed quite normally distributed, with no outliers. The normal Q-Q plot of residu- 
als for the robust fit in the right-hand panel of Figure 1.5 clearly shows that such a 
conclusion is wrong. This plot shows that the bulk of the residuals are indeed quite 
normally distributed, as evidenced by the compact linear behavior in the middle of 
the plot. At the same time, it clearly reveals the outliers that were evident in the plot 
of standardized residuals (Figure 1.4). 
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1.5 Correlation coefficients 


Let {(x;, y,)}, i= 1,...,n, be a bivariate sample. The most popular measure of asso- 
ciation between the x, and the y; is the sample correlation coefficient, defined as 
ae Lia — 0; -Y) 


(Zh Gy -3P)"(Di OVP)” 
where x and y are the sample means of the x; and y;. 

The sample correlation coefficient is highly sensitive to the presence of outliers. 
Figure 1.6 shows a scatterplot of the increase (gain) in numbers of telephones versus 
the annual change in new housing starts, for a period of 15 years in a geographical 
region within New York City in the 1960s and 1970s, in coded units. 

There are two outliers in this bivariate (two-dimensional) dataset that are clearly 
separated from the rest of the data. It is important to notice that these two outliers 
are not one-dimensional outliers; they are not even the largest or smallest values 
in any of the two coordinates. This observation illustrates an extremely important 
point: two-dimensional outliers cannot be reliably detected by examining the values 
of bivariate data one-dimensionally; that is, one variable at a time. 

The value of the sample correlation coefficient for the complete gain data is 
p = 0.44, and deleting the two outliers yields @ = 0.91, which is quite a large dif- 
ference and in the range of what an experienced user might expect for the dataset 
with the two outliers removed. The dataset with the two outliers deleted can be seen 


1.8 2.0 2.2 


1.6 


Gain in Telephones 


1.4 
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1.2 


Figure 1.6 Increase in numbers of telephones versus difference in new 
housing starts 
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as roughly elliptical, with a major axis sloping up and to the right and the minor axis 
sloping up and to the left With this picture in mind one can see that the two out- 
liers lie in the minor axis direction, though offset somewhat from the minor axis. The 
impact of the outliers is to decrease the value of the sample correlation coefficient by 
the considerable amount of 0.44 from the value of 0.91 it has with the two outliers 
deleted. This illustrates a general biasing effect of outliers on the sample correlation 
coefficient: outliers that lie along the minor axis direction of data that is otherwise 
positively correlated negatively influence the sample correlation coefficient. Simi- 
larly, outliers that lie along the minor axis direction of data that is otherwise negatively 
correlated will increase the sample correlation coefficient. Outliers that lie along a 
major axis direction of the rest of the data will increase the absolute value of the sam- 
ple correlation coefficient, making it more positive if the bulk of the data is positively 
correlated. 

If one uses a robust correlation coefficient estimate it will not make much differ- 
ence whether the outliers in the main-gain data are present or deleted. Using a good 
robust method fp.,, for estimating covariances and correlations on the main-gain data 
yields Pp,» = 0.85 for the entire dataset and pp, = 0.90 with the two outliers deleted. 
For the robust correlation coefficient, the change due to deleting the outlier is only 
0.05, compared to 0.47 for the classical estimate. A detailed description of robust 
correlation and covariance estimates is provided in Chapter 6. 

When there are more than two variables, examining all pairwise scatterplots for 
outliers is hopeless unless the number of variables is relatively small. But even look- 
ing at all scatterplots or applying a robust correlation estimate to all pairs does not 
suffice, for in the same way that there are bivariate outliers that do not stand out in 
any univariate representation, there may be multivariate outliers that heavily influence 
the correlations and do not stand out in any bivariate scatterplot. Robust methods deal 
with this problem by estimating all the correlations simultaneously, in such a man- 
ner that points far away from the bulk of the data are automatically downweighted. 
Chapter 6 considers these methods in detail. 


1.6 Other parametric models 


We do not want to leave the reader with the impression that robust estimation is only 
concerned with outliers in the context of an assumed normal distribution model. Out- 
liers can cause problems in fitting other simple parametric distributions such as an 
exponential, Weibull or gamma distribution, where the classical approach is to use 
a nonrobust maximum likelihood estimate (MLE) for the assumed model. In these 
cases one needs robust alternatives to the MLE in order to obtain a good fit to the 
bulk of the data. 
For example, the exponential distribution with density 


fd = wen, K>0 
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is widely used to model random inter-arrival and failure times, and it also arises in 
the context of times-series spectral analysis (see Section 8.14). It is easily shown 
that the parameter / is the expected value of the random variable x — in other words, 
A = E(x) — and that the sample mean is the MLE. We already know from the previous 
discussion that the sample mean lacks robustness and can be greatly influenced by 
outliers. In this case the data are nonnegative so one is only concerned about large 
positive outliers that cause the value of the sample mean to be inflated in a positive 
direction. So we need a robust alternative to the sample mean, and one naturally 
considers use of the sample median Med(x). It turns out that the sample median is an 
inconsistent estimate of A: it does not approach A when the sample size increases, and 
hence a correction is needed. It is an easy calculation to check that the median of the 
exponential distribution has value A log 2, where log stands for natural logarithm, and 
so one can use Med(x)/ log 2 as a simple robust estimate of A that is consistent with 
the assumed model. This estimate turns out to have desirable robustness properties, 
as described in Problem 3.15. 

The methods of robustly fitting Weibull and gamma distributions are much more 
complicated than the above use of the adjusted median for the exponential distribu- 
tion. We present one important application of robust fitting a gamma distribution due 
to Marazzi et al. (1998). The gamma distribution has density 


a = 1 a-1 -x/o 

f(xja,o0) = Tayoe™ e , x>0 
and the mean of this distribution is known to be E(x) = ao. The problem has to do 
with estimating the length of stay (LOS) of 315 patients in a hospital. The mean LOS 
is a quantity of considerable economic importance, and some patients whose hospital 
stays are much longer than those of the majority of the patients adversely influence 
the MLE fit of the gamma distribution. The MLE values turn out to be @y,¢ = 0.93 
and Gy,¢ = 8.50, while the robust estimates are @p,,, = 1.39 and Gp, = 3.64, and 
the resulting mean LOS estimates are fiy;; = 7.87 and jig, = 4.97. Some patients 
with unusually long LOS values contribute to an inflated estimate of the mean LOS 
for the majority of the patients. A more complete picture is obtained through the 
figures below. 

Figure 1.7 shows a histogram of the data along with the MLE and robust gamma 
density fit to the LOS data. The MLE underestimates the density for small values 
of LOS and overestimates the density for large values of LOS, thereby resulting 
in a larger MLE estimate of the mean LOS, while the robust estimate provides a 
better overall fit and a mean LOS that better describes the majority of the patients. 
Figure 1.8 shows a gamma Q-Q plot based on the robustly fitted gamma distribution. 
This plot reveals that the bulk of the data is well fitted by the robust method, while 
approximately 30 of the largest values of LOS appear to come from a sub-population 
of the patients characterized by longer LOS values. This is best modeled separately 
using another distribution, possibly another gamma distribution with different values 
of the parameters a@ and o. 
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Figure 1.8 Fitted gamma QQ-plot of LOS data 
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1.7. Problems 


Ll. 


1.2. 


1:3. 


1.4. 


Ls 


Show that if a value x) is added to a sample x={x,,...,x,}, when xg ranges 
from —co to +00, the standard deviation of the enlarged sample ranges between 
a value smaller than SD(x) and infinity. 


Consider the situation of the former problem. 


(a) Show that if n is even, the maximum change in the sample median when 
Xq ranges from —co to +c is the distance from Med(x) to the next order 
statistic the farthest from Med(x). 

(b) What is the maximum change if 1 is odd? 


Show for ¢; defined in (1.3) that |t;| <(m— 1)/ Vn for all possible datasets of 
size n, and hence for all datasets |t;| < 3 ifn < 10. 


The interquartile range (IQR) is defined as the difference between the third and 
the first quartiles. 


(a) Calculate the IQR of the N(y, 0”) distribution. 
(b) Consider the sample interquartile range 


IQR(x) => TQOR(),X, weg hp) = x, [3n/4]) = X([n/4]) 


as a measure of dispersion. It is known that sample quantiles tend to the 
respective distribution quantiles if these are unique. Based on this fact, deter- 
mine the constant c such that the normalized interquartile range IQRN(x) = 
IQR (x)/c is a consistent estimate of o when the data has a N(u,07) 
distribution. 

Can you think of a reason why you would prefer MADN(x) to IQRN(x) as 
a robust estimate of dispersion? 


(c 


wa 


Show that the median of the exponential distribution is Alog2, and hence 
Med(x)/ log 2 is a consistent estimate of A. 
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Location and Scale 


2.1 The location model 


For a systematic treatment of the situations considered in Chapter |, we need to 
represent them by probability-based statistical models. We assume that the outcome 
x; of each observation depends on the “true value” y of the unknown parameter (in 
Example 1.1, the copper content of the whole flour batch) and also on some random 
error process. The simplest assumption is that the error acts additively: 


xj =pwtu; G=1,...,n) (2.1) 


where the errors u,,...,u,, are random variables. This is called the location model. 
If the observations are independent replications of the same experiment under 
equal conditions, it may be assumed that 


@ uW,,...,U, have the same distribution function Fp 
@ W,,...,U, are independent. 
It follows that x,,...,x,, are independent, with common distribution function 
F(x) = Fo(x - #:) (2.2) 


and we say that the x; are i.i.d. — independent and identically distributed — random 
variables. 
The assumption that there are no systematic errors can be formalized as follows: 


e u,; and —u; have the same distribution, and consequently Fo(x) = 1 — Fo(—x). 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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An estimator ji is a function of the observations: #@ = fi(x,,...,x,,) = (x) (in 
some cases, the numeric value of an estimator for a particular sample will be called 
an estimate). We are looking for estimators such that in some sense ff © y with high 
probability. 

One way to measure the approximation is with the mean squared error (MSE): 


MSE(ft) = E(fi — 4)? (2.3) 
(other measures will be developed later). The MSE can be decomposed as 
MSE(ft) = Var(7t) + Bias(i)’, 


with 
Bias(#) = Ef — p, 


where “E” stands for the expectation. Note that if /7 is the sample mean and c is any 
constant, then 
B(x; +¢,...,%, #0) = W(X),...,x,) +¢ (2.4) 


and 
H(Cx,...,CX,) = CH], ---.X,)- (2.5) 


The same holds for the median. These properties are called respectively shift (or loca- 
tion) and scale equivariance of j#. They imply that, for instance, if we express our 
data in degrees Celsius instead of Fahrenheit, the estimator will automatically adapt 
to the change of units. 

A traditional way to represent “well-behaved” data — data without outliers — is to 
assume F’g is normal with mean 0 and unknown variance o”, which implies 


F = D(x) = N(u, 0”), 


where D(x) denotes the distribution of the random variable x, and N(y, v) is the nor- 
mal distribution with mean y and variance v. Classical methods assume that F belongs 
to an exactly known parametric family of distributions. If the data were exactly nor- 
mal, the mean would be an “optimal” estimator — the maximum likelihood estimator 
(MLE) (see next section) — and minimizes the MSE among unbiased estimators, 
and also among equivariant ones (Bickel and Doksum, 2001; Lehmann and Casella, 
1998). But data are seldom so well behaved. 

Figure 2.1 shows the normal Q-Q plots of the observations in Example 1.1. We 
see that the bulk of the data may be described by a normal distribution, but not the 
whole of it. The same feature can be observed in the Q—Q plot of Figure 1.2. In this 
sense, we may speak of F as being only approximately normal, with normality failing 
at the tails. We may thus state our initial goal as: looking for estimators that are almost 
as good as the mean when F is exactly normal, but that are also “good” in some sense 
when F is only approximately normal. 
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Figure 2.1 Q-Q plot of the flour data 


At this point it may seem natural to think that an adequate procedure could be to 
test the hypothesis that the data are normal; if it is not rejected, we use the mean, oth- 
erwise, the median; or, better still, fit a distribution to the data, and then use the MLE 
for the fitted one. But this has the drawback that very large sample sizes are needed 
to distinguish the true distribution, especially since here it is the tails — precisely the 
regions with fewer data — that are most influential. 


2.2 Formalizing departures from normality 


To formalize the idea of approximate normality, we may imagine that a proportion 
1 — e€ of the observations is generated by the normal model, while a proportion e€ is 
generated by an unknown mechanism. For instance, repeated measurements are made 
of something. These measurements are correct 95% of the time, but 5% of the time 
the apparatus fails or the experimenter makes an incorrect transcription. This may be 
described by supposing that: 


F=(1-6)G+eH, (2.6) 


where G = N(x, o*) and H may be any distribution; for instance, another normal 
with a larger variance and a possibly different mean. This is called a contaminated 
normal distribution. This model of contamination is called the Tukey—Huber model, 
after Tukey (1960), who gave an early example of the use of these distributions to 
show the dramatic lack of robustness of the SD, and Huber (1964), who derived the 
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first optimality results under this model. In general, F is called a mixture of G and H, 
and is called a normal mixture when both G and H are normal. 

To justify (2.6), let A be the event “the apparatus fails”, which has P(A) = «, 
and A’ its complement. We are assuming that our observation x has distribution G 
conditional on A’ and H conditional on A. Then by the total probability rule: 


F(t) = P(x < t) = Px < t|A’)P(A’) + P@ < t|A)P(A) 
= GNU —«)+ Ade. 


If G and H have densities g and h, respectively, then F has density 
f=(d-agtenh. (2.7) 


It must be emphasized that — as in the ozone layer example of Section 1.1 — 
atypical values are not necessarily due to erroneous measurements: they simply reflect 
an unknown change in the measurement conditions in the case of physical measure- 
ments, or more generally the behavior of a sub-population of the data.An important 
example of the latter is that normal mixture distributions have been found to often 
provide quite useful models for stock market returns; that is, the relative change in 
price from one time period to the next, with the mixture components corresponding 
to different volatility regimes of the returns. 

Another model for outliers are so-called heavy-tailed or fat-tailed distributions, 
where the density tails tend to zero more slowly than in the normal density tails. An 
example is the so-called Cauchy distribution, with density 


1 


f@= ee 


(2.8) 
It is bell shaped like the normal, but its mean does not exist. It is a particular case of 
the family of Student (or t) densities with v > 0 degrees of freedom, given by 


( =.” 
A@=c tlt ns (2.9) 


. wt D/D 
° fon Tw/2) 


where I is the gamma function. This family contains all degrees of heavy-tailedness. 
When v — oo, f, tends to the standard normal density; for v = 1 we have the Cauchy 
distribution. 

Figure 2.2 shows the densities of N(O,1), the Student distribution with four 
degrees of freedom, and the contaminated distribution (2.7) with g = N(0O,1),h= 
N(O, 100) and € = 0.10, denoted by N, T4 and CN respectively. To make comparisons 
more clear, the three distributions are normalized to have the same interquartile 
range. 


where c,, is a constant: 
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Figure 2.2 Standard normal (N), Student (T4), and contaminated normal (CN) den- 
sities, scaled to equal interquartile range 


If Fy = N(O, o*) in (2.2), then x is N(w, o* /n). As we shall see later, the sample 
median is approximately N(w, 1.57o7/n), so the sample median has a 57% increase 
in variance relative to the sample mean. We say that the median has a low efficiency 
in the normal distribution. 

On the other hand, assume that 95% of our observations are well-behaved, rep- 
resented by G = N(y, 1), but that 5% of the times the measuring system gives an 
erratic result, represented by a normal distribution with the same mean but a 10-fold 
increase in the standard deviation. We thus have the model (2.6) with e = 0.05 and 
H = N(y, 100). In general, under the model 


F =(1—e)N(u, 1) + eN(u, 7”) (2.10) 
we have (see (2.88), (2.27) and Problem 2.3) 


_ 2 
eer Var(Med(x)) © 5 cid 
n 


Var(x ) = In(l —e +e/tP 7 


(2.11) 
Note that Var(Med(x)) above means “the theoretical variance of the sample median of 
x’. It follows that for e = 0.05 and H = N(y, 100), the variance of x increases to 5.95, 
while that of the median is only 1.72. The gain in robustness of using the median is 
paid for by an increase in variance (“a loss in efficiency”) in the normal distribution. 

Table 2.1 shows the approximations for large n of n times the variances of the 
mean and median for different values of rc. It is seen that the former increases rapidly 
with 7, while the latter stabilizes. 
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Table 2.1. Variances (xn) of sample mean and median for large n 


0.05 0.10 
T nVar(x) nVar(Med) nVar(x) nVar(Med) 
3 1.40 1.68 1.80 1.80 
4 1.75 1.70 2.50 1.84 
5 2.20 1.70 3.40 1,86 
6 2.75 1.71 4.50 1.87 
10 5.95 1.72 10.90 1.90 


20 20.9 1.73 40.90 1.92 


In the next sections we shall develop estimators that combine the low variance 
of the mean in the normal with the robustness of the median under contamina- 
tion. For introductory purposes we will deal only with symmetric distributions. 
The distribution of the variable x is symmetric about mw if x— and yw —x have 
the same distribution. If x has a density f, symmetry about yw is equivalent to 
f(ut+x) =f(u—x). Symmetry implies that Med(x) = y, and if the expectation 
exists, also that Ex = y. Therefore, if the data have a symmetric distribution, there is 
no bias and only the variability is at issue. In Chapter 3, general contamination will 
be addressed. 

Two early and somewhat primitive ways to obtain robust estimators were based 
on deleting and truncating atypical data. Assume that we define an interval [a, b] 
(depending on the data) containing supposedly “typical” observations, such as a = 
x — 2s, b =x + 2s. Deletion means using a modified sample, obtained by omitting all 
points outside [a, b]. Truncation means replacing all x; < a by a and all x; > b by b, 
and not altering the other points. In other words, atypical values are swapped for the 
nearest typical ones. Naive uses of these ideas are not necessarily good, but some of 
the methods we shall study are elaborate versions of them. 


2.3. M-estimators of location 


We shall now develop a general family of estimators that contains the mean and the 
median as special cases. 


2.3.1 Generalizing maximum likelihood 


Consider again the location model (2.1). Assume that Fo, the distribution function 
of u;, has a density fy = F - The joint density of the observations (the likelihood 


function) is 
n 


Loy...) =] ei - 


i=l 
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The maximum likelihood estimator (MLE) of y is the value #@ — depending on 
X,,...,X, — that maximizes L(x,,...,%,3 1): 


f= fixy,...,x,) = argmax LOxy,...,%,3 2:12) 
H 


where “arg max” stands for “the value maximizing”. 

If we knew Fy exactly, the MLE would be “optimal” in the sense of attaining the 
lowest possible asymptotic variance among a “reasonable” class of estimators (see 
Section 10.8). But since we know Fo only approximately, our goal will be to find 
estimators that are “nearly optimal” for both of the following situations: 


(A) when Fo is exactly normal 
(B) when Fo is approximately normal (say, contaminated normal). 


If fy is everywhere positive, since the logarithm is an increasing function, (2.12) 


can be written as r 


ft = arg mi - 2.13 
jt = argmin 2 p(x; — 1) (2.13) 
where 
poe, (2.14) 
If Fy = N(O, 1), then 
fa) = PP (2.15) 


V2x 


and apart from a constant, p(x) = x”/2. Hence (2.13) is equivalent to 

A . _ 2 

y= arg min 2 (x; — MH). (2.16) 
If Fy is the double exponential distribution 


fala) = Se" (2.17) 


then p(x) = |x|, and (2.13) is equivalent to 


n 


j@ = arg min x, — pl. 2.18 
poem dh H| (2.18) 


We shall see below that the solutions to (2.16) and (2.18) are the sample mean and 
median, respectively. 
If p is differentiable, differentiating (2.13) with respect to yw yields 


Y) vi -@ =0 (2.19) 
i=1 
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with y = p’. If y is discontinuous, solutions to (2.19) might not exist, and in this case 
we shall interpret (2.19) to mean that the left-hand side changes sign at yw. Note that 
if fy is symmetric, then p is even and hence y is odd. 

If p(x) = x?/2, then w(x) = x, and (2.19) becomes 


YG; -M =0 
i=l 


which has j7 = X as solution. 

For p(x) = |x|, it will be shown that any median of x is a solution of (2.18). In fact, 
the derivative of p(x) exists for x 4 0, and is given by the sign function: w(x) = sgn(x), 
where 


-1 if x<0 
sen(x)=2 0 if x=0 (2.20) 
1 if x>0. 


Since the function to be minimized in (2.18) is continuous, it suffices to find the 
values of 4 where its derivative changes sign. Note that 


sen(x) = I(x > 0) — I(x < 0) (2.21) 


where I(.) stands for the indicator function; that is, 


i oe aso 
MEE oe ae. sh2 


Applying (2.21) to (2.19) yields 


> sence; - 2) = YG; - «> 0) = Ie - nw < 0) 


i=1 i=1 


= #(x; > HW) — #(X; < w) = 0 


and hence #(x; > “) = #(x; < 4), which implies that y is any sample median. 
From now on, the average of a dataset z = {z,,...,2Z,,} will be denoted by ave(z), 
or by ave,(z;) when necessary; that is, 


1 n 
ave(Z) = ave;(z;) = = 2 Zi 


and its median by Med(z) or Med,(z;). If c is a constant, z+ c and cz will denote 
the data sets (z; +¢,...,Z, +) and (cz,,...,¢z,). If x is a random variable with 
distribution F, the mean and median of a function g(x) will be denoted by E,,g(x) and 
Med,;-g(x), dropping the subscript F when there is no ambiguity. 
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Given a function p, an M-estimator of location is a solution of (2.13). We shall 
henceforth study estimators of this form, which need not be MLEs for any distribu- 
tion. The function p will be chosen in order to ensure goals (A) and (B) above. 

Assume y is monotone nondecreasing, with y(—co) < 0 < w(co). Then it can 
be proved (see Theorem 10.1) that (2.19) — and hence (2.13) — always has a solu- 
tion. If y is continuous and increasing, the solution is unique; otherwise the set of 
solutions is either a point or an interval (throughout this book, we shall call any func- 
tion g increasing (nondecreasing) if a < b implies g(a) < g(b) (g(a) < g(b))). More 
details on uniqueness are given in Section 10.1. 

It is easy to show that M-estimators are shift equivariant, as defined in (2.4) 
(Problem 2.5). The mean and median are scale equivariant, but this does not hold 
in general for M-estimators in their present form. This drawback will be overcome in 
Section 2.7. 


2.3.2 The distribution of M-estimators 


In order to evaluate the performance of M-estimators, it is necessary to calculate 
their distributions. Except for the mean and the median (see (10.60)), there are no 
explicit expressions for the distribution of M-estimators in finite sample sizes, but 
approximations can be found and a heuristic derivation is given in Section 2.10.2 
(a rigorous treatment is given in Section 10.3). 
Assume y is increasing. For a given distribution F’, define fy = Wo(F) as the solu- 
tion of 
Epw(x — Uo) = 0. (2.22) 


For the sample mean, w(x) = x, and (2.22) implies 4, = Ex; that is, the population 
mean. For the sample median, (2.21) and (2.22) yield 


P(x > Mo) — P(x < Mo) = 2F (Mo) — 1 = 0 


which implies F(“9) = 1/2, which corresponds to 4g = Med(x); that is, a population 
median. In general if F is symmetric, then py coincides with the center of symmetry 
(Problem 2.6). 

It can be shown (see Section 2.10.2) that when n > oo, 


ii ~p Ho (2.23) 


where “—._,” stands for “tends in probability” and jig is defined in (2.22) — we say that 
ft is “consistent for Wy” — and the distribution of /# is approximately 


E-(w(x = Ho)”) 


N(x 2) with p= Oe Mor (2.24) 
"hn (Epw'(x = My)? 
Note that under model (2.2) v does not depend on jp; that is, 
Ex, (wx)”) 
“ (2.25) 


v= ———_.. 
(E,,w' oy 
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If the distribution of an estimator jf is approximately N(o, v/n) for large n, we 
say that ff is asymptotically normal, with asymptotic value iy and asymptotic vari- 
ance v. The asymptotic efficiency of ff is the ratio 


_ Uo 
Eff) = > (2.26) 


where vu, is the asymptotic variance of the MLE, and measures how near // is to the 
optimum. The expression for v in (2.24) is called the asymptotic variance of fl. 

To understand the meaning of efficiency, consider two estimators with asymptotic 
variances v, and v. Since their distributions are approximately normal with variances 
v,/nand vz/n, if for example v, = 3v> then the first estimator requires three times 
as many observations to attain the same variance as the second. 

For the sample mean, yw’ = 1 and hence v = Var(x). For the sample median, the 
numerator of v is one. Here yw’ does not exist, but if x has a density f, it is shown in 
Section 10.3 that the denominator is 2f(fi9), and hence 


1 


— 297 
° AF uo? oe 


Thus for F = N(O, 1) we have 


v= mall = 1.571. 
4 


It will be seen that a type of p- and y-functions with important properties is the 
family of Huber functions, plotted in Figure 2.3: 


2s 2 # Wek ps 
x)= e 
o lx|-K2 if |x| >k 


with derivative 2y;,(x), where 


_ - ee eas 
Mee) sone if |x| >k 


It is seen that p, is quadratic in a central region, but increases only linearly to 
infinity. The M-estimators corresponding to the limit cases k — oo and k — O are the 
mean and the median, and we define yo(x) as sgn(x). 

The value of k is chosen in order to ensure a given asymptotic variance — hence a 
given asymptotic efficiency — for the normal distribution. Table 2.2 gives the asymp- 
totic variances of the estimator at model (2.6) with G = N(O, 1) and H = N(0, 10), for 
different values of k. 

Here we see the trade-off between robustness and efficiency: when k = 1.4, the 
variance of the M-estimator for the normal is only 4.7% larger than that of x (which 
corresponds to k = co) and much smaller than that of the median (which corresponds 
to k = 0), while for contaminated normals it is clearly smaller than both. 


M-ESTIMATORS OF LOCATION 27 


Rho 
2.0 3.0 


1.0 


0.0 


Pel 
0.00.5 1.0 


a ‘ 


-1.0 


-3 -2 -1 0 1 2 3 
x 


Figure 2.3. Huber p- and y-functions 


Table 2.2 Asymptotic variances of Huber M-estimator 


k e=0 € = 0.05 e=0.10 


0 1.571 1.722 1.897 
0.7 1.187 1.332 1.501 
1.0 1.107 1.263 1.443 
1.4 1.047 1.227 1.439 
Ie 1.023 1,233 1.479 
2.0 1.010 1.259 1.550 


oo 1.000 5.950 10.900 


Huber’s y is one of the few cases where the asymptotic variance at the normal 
distribution can be calculated analytically. Since Wy (x) = I(|x| < k), the denominator 
of (2.24) is (O(k) — @(—k))*. The reader can verify that the numerator is 


Egy, (x)* = 2[k? (1 — ®(k) + ®(k) — 0.5 — k(k)] (2.30) 


where g and © are the standard normal density and distribution function, respec- 
tively (Problem 2.7). In Table 2.3 we give the values of k yielding prescribed asymp- 
totic variances v for the standard normal. The last row gives values of the quantity 
a = | — Ok), which will play a role in Section 2.4. 
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Table 2.3 Asymptotic variances for 
Huber’s psi-function 


k 0.66 1.03 1.37 


v 1.20 1.10 1.05 


a 0.25 0.15 0.085 


2.3.3 An intuitive view of M-estimators 


A location M-estimator can be seen as a weighted mean. In most cases of interest, 
y(0) = 0 and y’(0) exists, so that y is approximately linear at the origin. Let 


x)/x if 
W@) = { vos ie (2.31) 
w'(0) if x=0. 


Then (2.19) can be written as 


Y Wo; - De; - M =0, 


i=l 
or equivalently 


wit with w,= W(x; - a 2.32 
ae ale le ee 
which expresses the estimator as a weighted mean. Since in general W(x) is a non- 
increasing function of |x|, outlying observations will receive smaller weights. Note 
that although (2.32) looks like an explicit expression for f7, actually the weights on 
the right-hand side depend also on /7. Besides its intuitive value, this representation 
of the estimator will be useful for its numeric computation in Section 2.8. The weight 
function corresponding to Huber’s yw is 


W,(x) = min { me \ (2.33) 
|x| 


which is plotted in the upper panel of Figure 2.4. 
Another intuitive way to interpret an M-estimator is to rewrite (2.19) as 


Be et Al . A lx A 
—4 — —_— = 9 r 2.34 
H=Ut+ i 2 w(x; — #) a 2 C(x;, #) ( ) 


where 
C(% W) = U+wx- pH), (2.35) 
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Figure 2.4 Huber and bisquare weight functions 


which for the Huber function takes the form 


u-k if x<u-k 
C(x, W) = x if u-k<x<ywtk (2.36) 
ut+k if x>uUtk. 


In other words, #@ may be viewed as an average of the modified observations 
€(x;, ff) (called “pseudo-observations”): observations in the bulk of the data remain 
unchanged, while those too large or too small are truncated as described at the end 
of Section 2.1 (note that here the truncation interval depends on the data). 


2.3.4 Redescending M-estimators 


It is easy to show (Problem 2.15) that the MLE for the Student family of densities 
(2.9) has the y-function 


e+e’ 


w(x) = (2.37) 


which tends to zero when x > oo. This suggests that for symmetric heavy-tailed dis- 
tributions, it is better to use “redescending” ys that tend to zero at infinity. This 


30 LOCATION AND SCALE 


implies that for large x, the respective p-function increases more slowly than Huber’s 
p (2.28), which is linear for x > k. 

We will later discuss the advantages of using a bounded p. A popular choice of 
p- and y-functions is the bisquare (also called biweight) family of functions: 


1-1-@/eeP if [xl <k 
es 2. 
ts) { 1 if |xl>k on 
with derivative p’(x) = 6y(x)/k? where 
2 2 
w(®) = sft S (=) K(|x| <b. (2.39) 


These functions are displayed in Figure 2.5. Note that y is everywhere differentiable 
and it vanishes outside [—k, k]. M-estimators with y vanishing outside an interval are 
not MLEs for any distribution (Problem 2.12). 

The weight function (2.31) for this family is 


x\2] 
W(x) = ji - (=) K(x <® 
and is plotted in Figure 2.4. 
fo} 
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Figure 2.5 p- and y-functions for the bisquare estimator 
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Table 2.4 Values of k for prescribed 
efficiencies of bisquare estimator 


Efficiency 0.80 0.85 0.90 0.95 
k 3.14 3.44 3.88 4.68 


If p is everywhere differentiable and y is monotonic, then the forms (2.13) and 
(2.19) are equivalent. If y is redescending, some solutions of (2.19) — usually called 
“bad solutions” — may not correspond to the absolute minimum of the criterion, which 
defines the M-estimator. 

Estimators defined as solutions of (2.19) with monotone yw will be called 
“monotone M-estimators” for short, while those defined by (2.13) when yw is not 
monotone will be called “redescending M-estimators”. Numerical computing of 
redescending location estimators is essentially no more difficult than for monotone 
estimators (Section 2.8.1). It will be seen in Section 3.4 that redescending estimators 
offer an increase in robustness when there are large outliers. 

The values of k for prescribed efficiencies (2.26) of the bisquare estimator are 
given in Table 2.4. If has a nondecreasing derivative, it can be shown (Feller, 1971) 
that for all x, y 


p(ax + (1 —a@)y) < ap(x) + 1 — a)p(y) Va € [0, 1]. (2.40) 


Functions verifying (2.40) are referred to as convex. 
We state the following definitions for later reference. 


Definition 2.1. Unless stated otherwise, a p-function will denote a function p such 
that: 


RI p(x) is a nondecreasing function of |x| 

R2 p(0) =0 

R3 p(x) is increasing for x > 0 such that p(x) < p(oo) 
R4 if p is bounded, it is also assumed that p(co) = 1. 


Definition 2.2. A w-function will denote a function wy that is the derivative of a 
p-function, which implies in particular that 


W1 w is odd and w(x) > 0 for x > 0. 


2.4 Trimmed and Winsorized means 


Another approach to robust estimation of location would be to discard a proportion of 
the largest and smallest values. More precisely, let a € [0, 1/2) and m = [na] where 
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[.] stands for the integer part, and define the a-trimmed mean as 


n—-m 


7 1 
a= n—2m 2 “oy 


i=m+1 


where x(;, denotes the order statistics (1.2). 

The reader may think that we are again suppressing observations. Note, how- 
ever, that no subjective choice has been made: the result is actually a function of all 
observations (even of those that have not been included in the sum). 

The limit cases a = 0 and a = 0.5 correspond to the sample mean and median, 
respectively. For the data of Example 1.1, the a-trimmed means with a = 0.10 and 
0.25 are, respectively, 3.20 and 3.27. Deleting the largest observation changes them 
to 3.17 and 3.22, respectively. 

The exact distribution of trimmed means is intractable. Its large-sample approx- 
imation is more complicated than that of M-estimators, and will be described in 
Section 10.7. It can be proved that for large n the distribution under model (2.1) is 
approximately normal, and for symmetrically distributed u;, the asymptotic distribu- 
tion is D(ff)  N(u, v/n), where the asymptotic variance v is that of an M-estimator 
with Huber’s function y;,, where k is the (1 — a)-quantile of u: 

2 
_ Ely? en 

(1 - 2a)? 
The values of a yielding prescribed asymptotic variances at the standard normal are 
given at the bottom of Table 2.3. Note that the asymptotic efficiency of X95 is 0.83, 
even though we seem to be “throwing away” 50% of the observations. Note also that 
the asymptotic variance of a trimmed mean is not a trimmed variance. This would be 

so if the numerator of (2.41) were 


E([(x — w(x — pl) < kD. 


An idea similar to the trimmed mean is the a-Winsorized mean (named after the 
biostatistician Charles P. Winsor), defined as 


n-m 
es 1 
Xq = (mn + MX(n—m-+1) + > “0 5 


i=m+1 


where m and the Xj) are as above. That is, extreme values, instead of being deleted as 
in the trimmed mean, are shifted towards the bulk of the data. 

It can be shown that for large n, x, is approximately normal (Bickel 1965). If 
the distribution F of u; is symmetric and has a density f, then X, is approximately 
N(y, v/n). with 


2 
v= 2a (11. + ts) + Ewl(uy <u < ya). (2.42) 
SUy-@) 


where u,_, is the (1 — a)-quantile of F. 
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A more general class of estimators, called L-estimators, is defined as linear com- 
binations of order statistics: 


n 
f=) axe (2.43) 
i=1 
where the a;s are given constants. For a-trimmed means, 
1 E 
a; = I(m+1<i<n-m), (2.44) 
n—2m 


and for a-Winsorized means 
a; = 1 ntti =m)+mii=n—m+1)+(m+1<i<n-m)). (2.45) 
n 


It is easy to show (Problem 2.10) that if the coefficients of an L-estimator satisfy the 


conditions 
n 


a; = 0, > a@=1, 4; =4,_j41, (2.46) 
i=1 


then the estimator is shift and scale equivariant, and also fulfills the natural conditions 


C1 If x; > 0 for all i, then # > 0 
C2 Ifx; =c for all i, then ff =c 
C3 ja(—x) = —ji(x). 


2.5 M-estimators of scale 


In this section we discuss a situation that, while not especially important in itself, will 
play an important auxiliary role in the development of estimators for location, regres- 
sion and multivariate analysis. Consider observations x; satisfying the multiplicative 
model 

X; = OU; (2.47) 


where the u; are i.i.d with density fy and o > 0 is the unknown parameter. The distri- 
butions of the x; constitute a scale family, with density 


1 x 
— fy (=). 
o o 
Examples are the exponential family, with f(x) = exp(—x)I(x > 0), and the normal 


scale family N(0, 07), with fy given by (2.15). 
The MLE of o in (2.47) is 


n 
nm 1 x; 
é = argmax — || fy —). 
oO zl = 
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Taking logs and differentiating with respect to o yields 


+ do(#) =| (2.48) 


where p(t) = ny(t), with y = —fo/fo- If fy is N(O, 1) then p(t) = 1’, which yields 6 = 
1/ave(x2) (the root mean square, RMS); if f is double-exponential defined in (2.17), 
then p(t) = |¢|, which yields 6 = ave(|x|). Note that if f) is even, so is p, and this 
implies that 6 depends only on the absolute values of the x;. 

In general, any estimator satisfying an equation of the form 


+d o(Z) = (2.49) 


where p is a p-function and 6 is a positive constant, will be called an M-estimator 
of scale.Note that in order for (2.49) to have a solution we must have 0 < 6 < p(co). 
Hence if p is bounded it will be assumed without loss of generality that 


p(o)=1, 6 €(0,1). 


In the rarely occurring event that #(x; = 0) > n(1 — 6) should happen, then (2.49) has 
no solution. In this case it is natural to define G(x) = 0. It is easy to verify that scale 
M-estimators are equivariant, in the sense that G(cx) = co(x) for any c > 0, and if p 
is even then 

6(cx) = |cle(x) 


for any c. For large n, the sequence of estimators (2.49) converges to the solution of 
Ep (=) =5 (2.50) 
o 


if it is unique (Section 10.2); see Problem 10.6. 
The reader can verify that the scale MLE for the Student distribution is equivalent 
to 5 
t 1 
= and 6 = ——. 
tt) P +o v+l1 


(2.51) 


A frequently used scale estimator is the bisquare scale, where p is given by (2.38) 
with k = 1; that is, 
p(x) = min{1 — (1 — x’), 1} (2.52) 


and 6 = 0.5. It is easy to verify that (2.51) and (2.52) satisfy the conditions for a 
p-function in Definition 2.1. 
When p is the step function 


p(t) = I(|t| > ©), (2.53) 
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where c is a positive constant, and 6 = 0.5, we have 6 = Med(|x|)/c. The argument 
in Problem 2.12 shows that it is not the scale MLE for any distribution. 

Most often we shall use a p that is quadratic near the origin — that is, p’(0) = 0 and 
p’’(0) > 0 —and in such cases an M-scale estimator can be represented as a weighted 
RMS estimator. We define the weight function as 


x)/x? if 
W) = \ C/x i (2.54) 
p’(0) if x=0 


and then (2.49) is equivalent to 


m= Ly w(2)x 
—— W( = )x-. 2.55 
a ni 2 Co a ( ) 


It follows that can be seen as a weighted RMS estimator. For the Student MLE 


vo2— =. (2.56) 
v+x2 
and for the bisquare scale 
W(x) = min{3 — 3x7 + x4, 1/27}. (2.57) 


It is seen that larger values of x receive smaller weights. 

Note that using p(x/c) instead of p(x) in (2.49) yields G/c instead of G. This 
can be used to normalize G to have a given asymptotic value, as will be done at the 
end of Section 2.6. If we want @ to coincide asymptotically with SD(x) when x is 
normal, then (recalling (2.50)) we have to take c as the solution of Ep(x/c) = 6 with 
x ~ N(O, 1), which can be obtained numerically. For the bisquare scale, the solution 
isc = 1.56. 

Although scale M-estimators play an auxiliary role here, their importance will be 
seen in Chapters 5 and 6. 


2.6 Dispersion estimators 


The traditional way to measure the variability of a dataset x is with the standard 


deviation (SD) i 
= Ly 2 
SD(x) = [5 des 


For any constant c the SD satisfies the shift invariance and scale equivariance con- 
ditions 
SD(x + c) = SD(x), SD(cx) = |c| SD(x). (2.58) 


Any statistic satisfying (2.58) will be called a dispersion (or scatter) estimator. 
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In Example 1.1 we observed the lack of robustness of the standard deviation, and 
we now consider possible robust alternatives. One alternative estimator proposed in 
the past is the mean absolute deviation (MD): 


MD(x) = ; Y ix; 3 (2.59) 
i=1 


which is also sensitive to outliers, although less so than the SD (Tukey, 1960). In the 
flour example, the MDs with and without the largest observation are, respectively, 
2.14 and 0.52: still a large difference. 

Both the SD and MD are defined by first centering the data by subtracting x (which 
ensures shift invariance) and then taking a measure of “largeness” of the absolute 
values. A robust alternative is to subtract the median instead of the mean, and then 
take the median of the absolute values, which yields the MAD estimator introduced 
in the previous chapter: 


MAD(x) = Med(|x — Med(x)|) (2.60) 


which clearly satisfies (2.58). For the flour data with and without the largest obser- 
vation, the MADs are 0.35 and 0.34, respectively. 

In the same way as (2.59) and (2.60), we define the mean and the median absolute 
deviations of a random variable x as 


MD(x) = Ex — Ex| (2.61) 


and 
MAD (x) = Med(|x — Med(x)|), (2.62) 


respectively. 
Two other well-known dispersion estimators are the range, defined as max(x) — 
min(X) = X(,) — X4), and the sample interquartile range 


IQR(x) = X(n—m+1) — Xm) 


where m = [n/4]. Both are based on order statistics; the former is clearly very sensi- 
tive to outliers, while the latter is not. 

Note that if x ~ N(u, 0”) (where “~” stands for “is distributed as”) then SD(x) = 
o by definition, while MD(x), MAD(x) and IQR(x) are constant multiples of o: 


MD(x) = c}o, MAD(X) = cn0, IQR(X) = 2c36, 


where 
c, = 2@(0) and c, = ®!(0.75) 


(Problem 2.11). Hence if we want a dispersion estimator that “measures the same 
thing” as the SD for the normal, we should normalize the MAD by dividing it by 
Cy & 0.675. The “normalized MAD” (MADN) is thus 


MAD(x) 


(2.63) 
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Likewise, we should normalize the MD and the IQR by dividing them by c,; and 
by 2c, respectively. 

Observe that for the flour data (which was found to be approximately normal) 
MADN = 0.53, which is not far from the standard deviation of the data without the 
outlier: 0.69. 

The first step in computing the SD, MD and MAD is “centering” the data; that 
is, subtracting a location estimator from the data values. The first two are not robust, 
and the third has a low efficiency. An estimator that combines robustness and effi- 
ciency is the following: first compute a location M-estimator /7, and then apply a scale 
M-estimator G to the centered data x; — jf. Here 6 should be normalized as described 
at the end of Section 2.6. We shall call this ¢ an M-dispersion estimator. 

Note that the IQR does not use centering. A dispersion estimator that does not 
require centering and is more robust than the IQR (in a sense to be defined in the 
next chapter) was proposed by Croux and Rousseeuw (1992) and Rousseeuw and 
Croux (1993). The estimator, which they call Q,,, is based only on the differences 
between data values. Let m = n/2. Call dj) < ... < di) the ordered values of the m 
differences dj; = x;j) — Xj with i > j. Then the estimator is defined as 


[n/2] + ‘) 


5 (2.64) 


Q, = dx), k= ( 
where [.] denotes the integer part. Since k  m/4, Q,, is approximately the first quar- 
tile of the djs. It is easy to verify that, for any k, Q, is shift invariant and scale 
equivariant. It can be shown that, for the normal, Q,, has an efficiency of 0.82, and 
the estimator 2.2220, is consistent for the SD. 

Martin and Zamar (1993b) studied another dispersion estimator that does not 
require centering and has interesting robustness properties (Problem 2.16b). 


2.7 M-estimators of location with unknown dispersion 


Estimators defined by (2.13) are not scale equivariant.For example, if all the x; in 
(2.13) are divided by 10, it does not follow that the respective solution y is divided 
by 10. Sections 2.7.1 and 2.7.2 deal with approaches to define scale equivariant esti- 
mators. To make this clear, assume we want to estimate in model (2.1) where F is 
given by the mixture (2.6) with G = N(u, 0”). Ifo were known, it would be natural to 
divide (2.1) by o to reduce the problem to the case o = 1, which implies estimating 


u by 
A i - ~—*) 
= arg min 2 o( . 5 


It is easy to verify that, as in (2.24), for large n the approximate distribution of 77 is 
N(y, v/n), where 


=o? EUG — wey (2.65) 
(Ew'((x — w)/o))? 
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2.7.1 Previous estimation of dispersion 


To obtain scale equivariant M-estimators of location, an intuitive approach is to use 


n 

AR . Xe 

ji = argmin 9" p ( io), (2.66) 
= o 

where G is a previously computed dispersion estimator. It is easy to verify that 7 is 

indeed scale equivariant. Since G does not depend on yp, (2.66) implies that ff is a 


solution of 
n % = mn 
vv ( 7 ) = 0. (2.67) 
é o 


It is intuitive that G must itself be robust. In Example 1.2, using (2.66) with 
bisquare y with k = 4.68, and ¢ = MADN(x), yields ff = 25.56; using ¢ = SD(x) 
instead gives ff = 25.12. Now add to the dataset three copies of the lowest value, 
—44. The results change to 26.42 and 17.19. The reason for this change is that the 
outliers “inflate” the SD, and hence the location estimator attributes to them too much 
weight. 

Note that since k is chosen in order to ensure a given efficiency for the unit normal, 
if we want j/ to attain the same efficiency for any normal, 6 must “estimate the SD 
at the normal’, in the sense that if the data are N(n, o”), then when n > oo, 6 tends 
in probability to o. This is why we use the normalized median absolute deviation 
MADN described previously, rather than the un-normalized version MAD. 

If a number m > n/2 of data values are concentrated at a single value x9, we 
have MAD(x) = 0, and hence the estimator is not defined. In this case we define 
ft = Xo = Med(x). Besides being intuitively plausible, this definition can be justified 
by a limit argument. Let the n data values be different, and let m of them tend to xo. 
Then it is not difficult to show that, in the limit, the solution of (2.66) is xp. 

It can be proved that if F is symmetric, then when nis large, ff behaves as if ¢ were 
constant, in the following sense: if 6 tends in probability to o, then the distribution 
of f# is approximately normal with variance (2.65) (for asymmetric F the asymptotic 
variance is more complicated; see Section 10.6). Therefore the efficiency of ff does 
not depend on that of &. In Chapter 3 it will be seen, however, that its robustness does 
depend on that of 6. 


2.7.2 Simultaneous M-estimators of location and dispersion 


An alternative approach is to consider a location—dispersion model with two 
unknown parameters 
Xj = M+ ou; (2.68) 


where u; has density fo, and hence x; has density 


f@ = = fy (—*). (2.69) 


oO 
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In this case, o is the scale parameter of the random variables ou,, but it is a dispersion 
parameter for the x;. 
We now derive the simultaneous MLE of yw and o in model (2.69): 


n 
Pa 1 Xj; —H 
(8.6) = arg max = J] fo - ) 
= 


which can be written as 


n xy 
1.8) wemio { £5 mu : “) +00} (2.70) 
uo \ ne o 


with pp = — log fo. The main point of interest here is 4, while o is a “nuisance param- 
eter”. 

Proceeding as in the derivations of (2.19) and (2.49) it follows that the MLEs 
satisfy the system of equations 


dv (=) - 2.71) 
5 oO 


ly xf 
ae Pre ( zs )- > (2.72) 


where 
W(X) = phy PocatelX) = xy), 6 = 1. (2.73) 


The reason for notation “p,.,j¢ is that in all instances considered in this book, 
Pscale 18 a p-function in the sense of Definition 2.1; this characteristic is exploited in 
Section 5.4.1. The notation will be used whenever it is necessary to distinguish this 
Pscales USed for scale, from the p in (2.14), used for location; otherwise, we shall write 
just p. 

We shall deal in general with simultaneous estimators (f7, 6) defined as solutions 
of systems of equations of the form (2.71)—(2.72), which need not correspond to the 
MLE for any distribution. It can be proved (see Section 10.5) that for large n the 
distributions of ff and G are approximately normal. If F is symmetric then D(ji) 
N(u, v/n), with v given by (2.65), where y and o are the solutions of the system 


Ew (*) - (2.74) 


x- 
E Pscale (—*) = 0. (2.75) 


We may choose Huber’s or the bisquare function for y. A very robust choice for 
Pecale 8 (2.53), with c = 0.675 to make it consistent with the SD for the normal, which 
yields 


eee | os 
= ——— Med(|x— ji)). 2.7 
= D675 Med(lx— xl) (2.76) 
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Although this looks similar to using the previously computed MADN, it will be seen 
in Chapter 6 that the latter yields more robust results. 

In general, estimation with a previously computed dispersion is more robust than 
simultaneous estimation. However, simultaneous estimation will be useful in more 
general situations, as will be seen in Chapter 6. 


2.8 Numerical computing of M-estimators 


There are several methods available for computing M-estimators of location and/or 
scale. In principle one could use any of the general methods for equation solving 
such as the Newton—Raphson algorithm, but methods based on derivatives may be 
unsafe with the types of p- and y-functions that yield good robustness properties (see 
Chapter 9). Here we shall describe a computational method called iterative reweight- 
ing, which takes special advantage of the characteristics of the problem. 


2.8.1 Location with previously-computed dispersion estimation 


For the solution of the robust location estimation optimization problem (2.66), the 
weighted average expression (2.32) suggests an iterative procedure. Start with a 
robust dispersion estimator 6 (for instance, the MADN) and some initial estimator 
Ho (for instance, the sample median). Given Le compute 


t= Ww (*) (Gi=1,...,n) (2.77) 
; oO 

where W is the function in (2.31) and let 

pa W iX; 
pa Wij 


Results to be proved in Section 9.1 imply that if W(x) is bounded and nonincreas- 
ing for x > 0, then the sequence ji, converges to a solution of (2.66). The algorithm, 
which requires a stopping rule based on a tolerance parameter €, is thus: 


Run = (2.78) 


1. Compute ¢ = MADN(x) and pip = Med(x). 
2. For k = 0,1,2,..., compute the weights (2.77) and then fi, in (2.78). 
3. Stop when | ji; — Hy | < €6. 


If y is increasing the solution is unique, and the starting point 7p influences only 
the number of iterations. If y is redescending then fi) must be robust in order to insure 
convergence to a “good” solution. Choosing fig = Med(x) suffices for this purpose. 
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Figure 2.6 Averages of w(x — w) and p(x — yp) as a function of yu 


Figure 2.6 shows the averages of y((x — )/6) and of p((x — )/6) as a function 
of 4, where y and p correspond to the bisquare estimator with efficiency 0.95, and 
6 = MADN, for the data of Example 1.2, to which three extra values of the outlier 
—44 were added. Three roots of the estimating equation (2.67) are apparent; one 
corresponds to the absolute minimum of (2.66) while the other two correspond to 
a relative minimum and a relative maximum. This effect occurs also with the original 
data, but is less visible. 


2.8.2 Scale estimators 


For solving (2.49), the expression (2.55) suggests an iterative procedure. Start with 
some o> for instance, the normalized MAD (MADN). Given on compute 


w= W () (= 1... (2.79) 


OK 


where W is the weight function in (2.54) and let 


(2.80) 


O41 = 
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Then if W(x) is bounded, even, continuous and nonincreasing for x > 0, the sequence 
On converges to a solution of (2.55) and hence of (2.49) (for a proof see Section 9.4). 
The algorithm is thus: 


1. Fork = 0,1,2,..., compute the weights (2.79) and then Orsi in (2.80). 
2. Stop when |6;,;/6, — 1| < e. 


2.8.3 Simultaneous estimation of location and dispersion 


The procedure for solving the system (2.71)-(2.72) is a combination of the ones 
described in Sections 2.8.1 and 2.8.2. Compute starting values jij,Gp, and, given 
fy, 6,, compute fori =1,...,n 


and 
Wiki = Wie), Wr = Wore) 


where W, is the weight function W in (2.31) and W, is the W in (2.54) corresponding 
tO Pecale- Lhen at the kth iteration 
pe WikiXi nd _ G, - 2 
ye > Ph ~ Fg > WKi 7 
i=l 


i= 1ki 


Hey = 


2.9 Robust confidence intervals and tests 


2.9.1 Confidence intervals 


Since outliers affect both the sample mean x and the sample standard deviation s, con- 
fidence intervals for 4 = E(x) based on normal theory may be unreliable. Outliers may 
displace x and/or “inflate” s, resulting in one or both of the following degradations in 
performance: 


e the true coverage probability may be much lower than the nominal one; 

e the coverage probability may be either close to or higher than the nominal one, but 
at the cost of a loss of precision, in the form of an inflated expected confidence 
interval length. 


We briefly elaborate on these points. 
Recall that the usual Student confidence interval, justified by the assumption of a 
normal distribution for i.i.d. observations, is based on the “‘f-statistic’’, 


f= (2.81) 


=n 
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From this, one gets the usual two-sided confidence intervals for with level 1 — a 


S 


yn . 
where t¢ 


mp 1S the #-quantile of the ¢-distribution with m degrees of freedom. 

The simplest situation is when the distribution of the data is symmetric about 
y= Ex. Then Ex = yp and the confidence interval is centered. However, heavy tails 
in the distribution will cause the value of s to be inflated, and hence the interval 
length will be inflated, possibly by a large amount. Thus, in the case of symmet- 
ric heavy-tailed distributions, the price paid for maintaining the target confidence 
interval error rate a will often be unacceptably long confidence intervals. If the data 
have a mixture distribution (1 — €)N(u, 07) + €H, where H is not symmetric about 
H, then the distribution of the data is not symmetric about y and Ex # yw. Then the 
t confidence interval with purported confidence level 1 — a will not be centered and 
will not have the error rate a, and will lack robustness of both level and length. If the 
data distribution is both heavy tailed and asymmetric, then the ¢ confidence interval 
can fail to have the target error rate and at the same time have unacceptably large 
interval lengths. Thus the classic t confidence interval lacks robustness of both error 
rate (confidence level) and length, and we need confidence intervals with both types 
of robustness. 

Approximate confidence intervals for a parameter of interest can be obtained from 
the asymptotic distribution of a parameter estimator. Robust confidence intervals that 
are not much influenced by outliers can be obtained by imitating the form of the 
classical Student ¢ confidence interval, but replacing the average and SD by robust 
location and dispersion estimators. Consider the M-estimators ff in Section 2.7, and 
recall that if D(x) is symmetric then for large n the distribution of ## is approximately 
N(u, v/n), with v given by (2.65). Since v is unknown, an estimator 6 may be obtained 
by replacing the expectations in (2.65) by sample averages, and the parameters by 
their estimators: 


xX Tr-1,1-a/2 


5 a go ave LW BYP es 
(avelw/(x = /@)IP 


A robust approximate t-statistic (“Studentized M-estimator’’) is then defined as 


(2.83) 


and its distribution is approximately normal N(0, 1) for large n. Thus a robust approx- 
imate interval can then be computed as 


es o 
fit ace” een) 


where Zp denotes the #-quantile of N(O, 1). 
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Table 2.5 Confidence intervals for flour data 


Estimator i / 0(fi/n) Interval 

Mean 4.280 1.081 2.161 6.400 
Bisquare M 3.144 0.130 2.885 3.404 
tics 3.269 0.117 3.039 3.499 


A similar procedure can be used for the trimmed mean. Recall that the asymp- 
totic variance of the a-trimmed mean for symmetric F is as shown in (2.41). We can 
estimate v with 


n-m 


m 1 ds ws K 
b= ( > (xq — ny + (Xin) - py + MQ (n—-m41) — a) F (2.85) 


i=m+1 


An approximate f-statistic is then defined as (2.83). Note again that the variance of 
the trimmed mean is not a trimmed variance, but rather a “Winsorized” variance. 

Table 2.5 gives for the data of Example 1.1 the location estimators, their estimated 
asymptotic SDs and the respective confidence intervals with level 0.95. The results 
were obtained with script flour. 


2.9.2 Tests 


It appears that many applied statisticians have the impression that ¢ -tests are 
sufficiently “robust”, and that they should have no worries when using them. Again, 
this impression no doubt comes from the fact — a consequence of the central limit 
theorem — that it suffices for the data to have finite variance for the classical f 
-Statistic (2.81) to be approximately N(0, 1) in large samples. See for example the 
discussion to this effect in the introductory text by Box et al. (1978). This means 
that in large samples the Type | error rate of a level @ is in fact a for testing a 
null hypothesis about the value of 7. However, this fact is misleading, as we now 
demonstrate. 

Recall that the t-test with level @ for the null hypothesis Hyp = {4 = wo} rejects Hy 
when the f-interval with confidence level 1 — a does not contain py. According to the 
discussion in Section 2.9.1 on the behavior of the f-intervals under contamination, we 
conclude that if the data are symmetric but heavy tailed, the intervals will be longer 
than necessary, with the consequence that the actual Type | error rate may be much 
smaller than a, but the Type 2 error rate may be too large; that is, the test will have low 
power. If the contaminated distribution is asymmetric and heavy tailed, both errors 
may become unacceptably high. 

Robust tests can be derived from a “robust t-statistic” (2.83) in the same way 
as was done with confidence intervals. The tests of level a for the null hypothesis 
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H = Mo against the two-sided alternative 4 # fg and the one-sided alternative py > pp 
have the rejection regions 


l@ — Hol > Voz1-a/2 and @ > Wy + Voz-45 (2.86) 


respectively. 

The robust t-like confidence intervals and test are easy to apply. They have, how- 
ever, some drawbacks when the contamination is asymmetric, because of the bias of 
the estimator. Procedures that ensure a given probability of coverage or Type | error 
probability for a contaminated parametric model were given by Huber (1965, 1968), 
Huber-Carol (1970), Rieder (1978, 1981) and Fraiman et al. (2001). Yohai and Zamar 
(2004) developed tests and confidence intervals for the median that are “nonparamet- 
ric’, in the sense that their level is valid for arbitrary distributions. Further references 
on robust tests will be given in Section 4.7. 


2.10 Appendix: proofs and complements 
2.10.1 Mixtures 
Let the density f be given by 
f=(1—-e)g teh. (2.87) 


This is called a mixture of g and h. If the variable x has density f, and q is any function, 
then 


Eq(x) = / g(x)f(xdx = (1 —€) / q(x)gadx + € _) g(x)h(x)dk. 


oO 


With this expression we can calculate Ex; the variance is obtained from 
Var(x) = E(x?) — (Ex). 
If g = N(O, 1) and h = N(a, b”) then 
Ex = ea and Ex’ = (1 —€) + €(a* +b”), 


and hence 
Var(x) = (1 — €)(1 + €a’) + €b”. (2.88) 


Evaluating the performance of robust estimators requires simulating distributions 
of the form (2.87). This is easily accomplished: generate u with uniform distribution 
in (0, 1); if u > €, generate x with distribution g, else generate x with distribution h. 


46 LOCATION AND SCALE 


2.10.2 Asymptotic normality of M-estimators 


In this section we give a heuristic proof of (2.24). To this end we begin with an intu- 
itive proof of (2.23). Define the functions 


Ais) = Ey(x-9), As) = = Dv 9) 
i=l 


so that ## and po verify respectively 
A,(A) = 0, A(uo) = 0. 


For each s, the random variables y(x; — s) are i.i.d. with mean A(s), and hence the 
law of large numbers implies that when n > oo 


A,(8) >) A(s) WS. 


It is intuitive that also the solution of ACS) = 0 should tend to that of A(s) = 0. 
This can in fact be proved rigorously (see Theorem 10.5). 

Now we prove (2.24). Taking the Taylor expansion of order 1 of (2.19) as a func- 
tion of ff about po yields 


n 


0 = )! w(x — Ho) — B= Ho) Y) w(x; — Ho) + 08 — Ho) (2.89) 


i=l i=l 
where the last (““second-order’’) term is such that 


lim oo) = 0. 
t>0 ¢ 


Dropping the last term in (2.89) yields 


Vif = Ho) ® 


. (2.90) 


B 


with 
A, = V/n ave(y(x — Mo), B, = ave(y"(x — Ho). 


The random variables w(x; — fp) are ii.d. with mean 0 because of (2.22). 
The central limit theorem implies that the distribution of A, tends to N(O,a) with 
a = Ew(x— uo)’, and the law of large numbers implies that B,, tends in probability 
to b = Ey'(x — uo). Hence by Slutsky’s lemma (see Section 2.10.3) A,/B,, can be 
replaced for large n by A,,/b, which tends in distribution to N(0, a/b’), as stated. 
A rigorous proof will be given in Theorem 10.7. 

Note that we have shown that Vii — Mo) converges in distribution; this is 
expressed by saying that “7 has order n~!/? consistency”. 
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2.10.3 Slutsky’s lemma 


Let u,, and v,, be two sequences of random variables such that u, tends in probability 
to a constant u, and the distribution of v,, tends to the distribution of a variable v 
(abbreviated “v,, >, Uv’). Then 


U, + U, Sq utvu and u,v, 4g Uv. 


The proof can be found in Bickel and Doksum (2001, p. 467) or Shao (2003, 
p. 60). 


2.10.4 Quantiles 


For a € (0, 1) and F a continuous and increasing distribution function, the a-quantile 
of F is the unique value g(a) such that F(g(@)) = a. If F is discontinuous, such a value 
might not exist. For this reason we define g(a) in general as a value where F(t) — a 
changes sign; that is, 


sgn { fim ro - ao} # sgn { fin ro - ao} , 


where “ft” and “|” denote the limits from the left and from the right, respectively. It is 
easy to show that such a value always exists. It is unique if F is increasing. Otherwise, 
it is not necessarily unique, and hence we may speak of an a-quantile. 
If x is arandom variable with distribution function F(t) = P@& < 1), g(a) will also 
be considered as an @ -quantile of the variable x, and in this case is denoted by x,. 
If g is a monotonic function, and y = g(x), then 


Yq  ifg is increasing 
8(%q) = { , (2.91) 


Yi_-q if g is decreasing, 


in the sense that, for example, if z is an a-quantile of x, then 2 isan a-quantile of x, 

When the a-quantile is not unique, there exists an interval [a, b) such that F(t) = 
a for t € [a,b). We may obtain uniqueness by defining g(a) as a — the smallest 
a-quantile — and then (2.91) remains valid. It seems more symmetric to define it as 
the midpoint (a + b)/2, but then (2.91) ceases to hold. 


2.10.5 Alternative algorithms for M-estimators 
2.10.5.1 The Newton—Raphson procedure 


The Newton—Raphson procedure is a widely used iterative method for the solution of 
nonlinear equations. To solve the equation /(t) = 0, at each iteration, his “linearized”; 
that is, replaced by its Taylor expansion of order | about the current approximation. 
Thus, if at iteration m we have the approximation f,,, then the next value ¢,,,, is the 
solution of 

(ty) +e BG Mae = tn) = 0. 
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In other words, 
Mtn) 
tnt => tin = h(t.) (2.92) 

If the procedure converges, the convergence is very fast; but it is not guaranteed to 
converge. If h’ is not bounded away from zero, the denominator in (2.92) may become 
very small, making the sequence f,, unstable unless the initial value fg is very near to 
the solution. 

This happens in the case of a location M-estimator, where we must solve the 
equation h(u) = 0 with h(w) = ave{y(x — )}. Here the iterations are 


yy W(X; ~ Bin) 
pam yw! (x; = Hm) 


If y is bounded, its derivative y’ tends to zero at infinity, and hence the denom- 
inator is not bounded away from zero, which makes the procedure unreliable. For 
this reason, algorithms based on iterative reweighting are preferable, since these are 
guaranteed to converge. 

However, although the result of iterating the Newton—Raphson process indefi- 
nitely may be unreliable, the result of a single iteration may be a robust and efficient 
estimator, if the initial value jp is robust but not necessarily efficient, like the median; 
see Problem 3.16. 


Minti = Hn + (2.93) 


2.10.5.2 Iterative pseudo-observations 


The expression (2.34) of an M-estimator as a function of the pseudo-observations 
(2.35) can be used as the basis for an iterative procedure to compute a location esti- 
mator with previous dispersion o. Starting with an initial fip, define 


2 i 4 ds 
Mn+ = A p CX}, Min» 0), (2.94) 
where ae 
$x, 4,0) = n+ ow (~—*). (2.95) 


It can be shown that y,, converges under very general conditions to the solution of 
(2.67) (Huber and Ronchetti, 2009). However, the convergence is much slower than 
that corresponding to the reweighting procedure. 


2.11 Recommendations and software 


For location we recommend the bisquare M-estimator with MAD scale, and the con- 
fidence intervals defined in (2.84). The function locScaleM (library RobSt at TM) 


PROBLEMS 49 


computes the bisquare and Huber estimators, their estimated standard deviations 
needed for the intervals, and the M-dispersion estimator defined in Section 2.6. 

The function scaleM (RobStatTM) computes the bisquare M-scale defined in 
Section 2.5. 


2.12 Problems 


2.1. Show that in a sample of size n from a contaminated distribution (2.6), the 
number of observations from H is random, with binomial distribution Bi(n, €). 


2.2. For the data of Example 1.2, compute the mean and median, the 25% trimmed 
mean and the M-estimator with previous dispersion and Huber’s yw with 
k = 1.37. Use the latter to derive a 90% confidence interval for the true value. 


2.3. Verify (2.11) using (2.27). 
2.4. For what values of v does the the Student distribution have moments of order k? 


2.5. Show that if w is a solution of (2.19), then 4 +c is a solution of (2.19) with 
x; + c instead of x;. 


2.6. Show that if x = fg + u where the distribution of u is symmetric about 0, then 
Ho is a solution of (2.22). 


2.7. Verify (2.30) [hint: use g’(x) = —x@(x) and integration by parts]. From this, 
find the values of k which yield variances 1/a with a = 0.90, 0.95 and 0.99 (by 
using an equation solver, or just trial and error). 


2.8. Compute the a-trimmed means with a =0.10 and 0.25 for the data of 
Example 1.2 


2.9. Show that if y is odd, then the M-estimator f# satisfies conditions C1-C2-C3 
at the end of Section 2.4. 


2.10. Show using (2.46) that L-estimators are shift and scale equivariant [recall that 
the order statistics of y; = —x; are yj) = —X(,_;41)!] and also fulfill C1-C2-C3 
of Section 2.4. 


2.11. Ifx ~ N(y, 02), calculate MD(x), MAD(x) and IQR(x). 


2.12. Show that if y = p’ vanishes identically outside an interval, there is no density 
verifying (2.14). 


2.13. Define the sample a-quantile of x,,...,x, — with a € (1/n,1/1/n) — as xq, 
where k is the smallest integer > na and x;; are the order statistics (1.2). Let 


w(x) = al(x > 0) — 1 — a)I(x < 0). 
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2.14. 


25. 
2.16. 


2A. 


2.18. 


2.19. 
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Show that = x4) is a solution (not necessarily unique) of (2.19). Use this fact 
to derive the asymptotic distribution of sample quantiles, assuming that D(x;) 
has a unique a-quantile. Note that this y is not odd. 


Show that the M-scale (2.49) with p(t) = I(|t| > 1) is the Ath order statistic of 
the |x,| with h = n — [nd]. 


Verify (2.37), (2.51) and (2.56) 


Let [a, b] where a and b depend on the data, be the shortest interval containing 
at least half of the data. 


(a) The Shorth (“shortest half’) location estimator is defined as the midpoint 
ft = (a+ b)/2. Show that ff = arg min, Med(|x — p)|). 

(b) Show that the difference b — a is a dispersion estimator. 

(c) For a distribution F, let [a, b] be the shortest interval with probability 0.5. 
Find this interval for F = N (1, 0”). 


Let ji be a location M-estimator. Show that if the distribution of the x; is sym- 
metric about yw, so is the distribution of #7, and that the same happens with 
trimmed means. 


Verify numerically that the constant c at the end of Section 2.5 that makes the 
bisquare scale consistent for the normal is indeed equal to 1.56. 


Show that 


(a) if the sequence y,, in (2.93) converges, then the limit is a solution of (2.19) 
(b) if the sequence in (2.94) converges, then the limit is a solution of (2.67). 


3 


Measuring Robustness 


In order to measure the effect of different locations of an outlier on an estimate, con- 
sider adding to a sample x = (x,...,x,,) an extra data point xg that is allowed to 
range on the whole real line. We define the sensitivity curve of the estimator f# for the 
sample x as the difference 


B(X],-++5X_sXq) — My, --- Xp) (3.1) 


as a function of the location xp of the outlier. 

For purposes of plotting and comparing sensitivity curves across sample sizes, 
it is convenient to use standardized sensitivity curves, obtained by multiplying (3.1) 
by (7 + 1) (see also Section 3.1). To make our examples clearer, we use a “sam- 
ple” formed from standard normal distribution quantiles, instead of a random one. 
Figure 3.1 plots: 


the standardized sensitivity curves of the median 

the 10% trimmed mean x9 19 

the 10% Winsorized mean Xo 19 

the Huber M-estimator with k = 1.37, using both the SD and the MADN as previ- 
ously computed dispersion estimators 

e the bisquare M-estimator with k = 4.68 using the MADN as dispersion estimator. 


Also included is an M-estimator with particular optimality properties, to be defined 
in Section 5.8.1. 

We can see that all curves are bounded, except the one corresponding to the 
Huber estimator with SD as dispersion estimator, which grows without bound 
with x). The same unbounded behavior (not shown in the figure) occurs with the 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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Median 10% Winsorized Mean 


10% Trimmed Mean Huber M-Estimates 


Bisquare M-Estimate Optimal M-Estimate 


Figure 3.1 Sensitivity curves of location estimators 


bisquare estimator with the SD as dispersion estimator. This shows the importance 
of a robust previous dispersion. All curves are nondecreasing for positive xg, except 
the one for the bisquare and “optimal” M-estimators. Roughly speaking, we say that 
the bisquare M-estimator rejects extreme values, while the others do not. The curve 
for the trimmed mean shows that it does not reject large observations, but just limits 
their influence. The curve for the median is very steep at the origin. 

Figure 3.2 shows the sensitivity curves of the SD along with the normalized MD, 
MAD and IQR. The SD and MD have unbounded sensitivity curves, while those of 
the normalized MAD and IQR are bounded. 

Imagine now that instead of adding a single point at a variable location, we replace 
m points by a fixed value x) = 1000. Table 3.1 shows the resulting “biases” 


L(x, Xo: sees XQoXm4y> un) = Hr, 9 y) 


as a function of m for the following location estimators: 


e the median 

e the Huber estimator with k = 1.37 and three different dispersions: previously 
estimated MAD (denoted by MADp), simultaneous MAD (“MADs’”) and 
previous SD 
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Figure 3.2 Sensitivity curves of dispersion estimators 

Table 3.1. The effect of increasing contamination on a sample of size 20 
m Mean Median H(MADp) H(MADs) H(SD)_ x, M-Bisq MAD — IQR 
1 50 0.00 0.03 0.04 16.06 0.04 —-0.02 0.12 0.08 
2 100 0.01 0.10 0.11 46.78 55.59 0.04 0.22 0.14 
4 200 0.21 0.36 0.37 140.5 166.7 0.10 0.46 30.41 
5 250 0.34 0.62 0.95 202.9 222.3 0.15 0.56 370.3 
7 350 0.48 1.43 42.66 350.0 333.4 0.21 1.29 740.3 
9 450 0.76 3.23 450.0 450.0 444.5 0.40 2.16 740.2 
10 500 500.5 500.0 500.0 500.0 500.0 500.0 739.3 740.2 


e the trimmed mean with a = 0.085 
e the bisquare estimator. 


We also provide the biases for the normalized MAD and IQR dispersion esti- 


mators. The choice of k and a was made so that both the Huber estimators and the 
trimmed mean have the same asymptotic variance for the normal distribution. 

The mean deteriorates immediately when m = 1, as expected, and since [an] = 
[0.085 x 20] = 1 the trimmed mean x, deteriorates when m= 2, as could be 
expected. The H(MADs) deteriorates rapidly, starting at m = 8, while H(SD) is 
already quite bad at m= 1. By contrast, the median, H(MADp) and M-Bisq do 
so only when m = n/2, with M-Bisq having smaller bias than HCMADp), and the 
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median (Med) having small biases, comparable to those of the M-Bisq (only slightly 
higher bias than M-Bisq at m = 4,5, 7,9). 

To formalize these notions, it will be easier to study the behavior of estimators 
when the sample size tends to infinity (“asymptotic behavior’). Consider an estima- 
tor 6, = = 6, (x) depending on a sample x = {x,,...,x,} of size n of i.i.d. variables 
with distribution F. In all cases of practical interest, there is a value depending on F, 
6... =6 oF), such that 

6, >, O9(F)- 


a) oF) i is the asymptotic value of the estimator at F. 

Jt 6, =X (the sample mean) then 6 ‘y= = Epx ine distribution mean), and 
if 6, (x) = Med(x) (the sample median) then 6 oh) = F-'(0.5) (the distribution 
median). If 6, is a location M-estimator given by (2.19) with y monotonic, it was 
stated in Section 2.10.2 that 6. (F) is the solution of 


Epw(x— 0) = 


A proof is given in Theorem 10.5. The same reasoning shows that if 6, is a scale 
M-estimator (2.49), then 6,,(F) is the solution of 


Exe (=) = 5 


It can also be shown that if 6, is a location M-estimator given by (2.13), then 0,(F ) 
is the solution of 
Erp(x — 8) = min. 


Details can be found in Huber and Ronchetti (2009; Sec. 6.2). Asymptotic values also 
exist for the trimmed mean (Section 10.7). 
The typical distribution of data depends on one or more unknown parameters. 
Thus in the location model (2.2) the data have distribution function F i (x) = Fox - 
Ht), and in the location—dispersion model (2.68) the distribution is F'g(x) = Fo((x — 
H)/o) with 0 = (4,0). These are called parametric models. In the location model we 
have seen in (2.23) that if the data are symmetric about y and f? is an M-estimator, 
then 7 >, Hand so Poo F ) = H. An estimator 6 of the parameter(s) of a parametric 
family Fy will be called consistent if 


6,,(F9) = 0. (3.2) 


Since we assume F to be only approximately known, we are interested in the behavior 
of 6. (F) when F ranges over a “neighborhood” of a distribution Fg. There are sev- 
eral ways to characterize neighborhoods. The easiest to deal with are contamination 
neighborhoods: 

F(F,e)= {Ul -e)F+eG: Geg} (3.3) 


where G is a suitable set of distributions, often the set of all distributions but in some 
cases the set of point-mass distributions, where the “point mass” 6, is the distribution 
such that P(x = xg) = 1. 
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3.1 The influence function 


The influence function (IF) of an estimator (Hampel, 1974) is an asymptotic version 
of its sensitivity curve. It is an approximation to the behavior of @,, when the sample 
contains a small fraction € of identical outliers. It is defined as 


0,.((1 — €)F + €6,.) — 0,,(F) 


IF gp, F) = lig (3.4) 
=F 5 ier ten (3.5) 
de © "E10" 


where 6,, is the point-mass at xy and “|” stands for “limit from the right”. If there 


are p unknown parameters, then 6. is a p-dimensional vector and so is its IF. Hence- 
forth, the argument of 0,.(F ) will be dropped if there is no ambiguity. 

The quantity 6,((1 — €)F + €6,,) is the asymptotic value of the estimator 
when the underlying distribution is F and a fraction € of outliers is equal to x9. Thus 
if € is small, this value can be approximated by 


0,.((1 — €)F + €6,.) © 0,,(F) + €IF 9(xo, F) 


and the bias 0,,((1 — €)F + €6,,) — ,.(F) is approximated by eIF 6(xo, F). 
The IF may be considered as a “limit version” of the sensitivity curve, in the 
following sense. When we add the new observation x, to the sample x,,...,x, the 


fraction of contamination is 1/(m + 1), and so we define the standardized sensitivity 
curve (SC) as 


O44 Kises05 XpqeXQ) 0, Oia esg Hp) 


n+ 


SC, (%) = ; 
=(n+ 1) (Cacr Jc xo) = 6,(x1, ihe .%,)) 


which is similar to (3.4) with e = 1/(n + 1). One would expect that if the x, are i.i.d. 
with distribution F’, then SC,,(xp)) © IF(xo, F) for large n. This notion can be made 
precise. Note that for each xg, SC,,(xq) is a random variable. Croux (1998) has shown 
that if 6 is a location M-estimator with a bounded and continuous y-function, or is a 
trimmed mean, then for each x, 


SC,,(%o) Fag. IFA(X0; F), (3.6) 


where “a.s.” denotes convergence with probability | (“almost sure” convergence). 
This result is extended to general M-estimators in Section 10.4. See, however, the 
remarks in Section 3.1.1. 

It will be shown in Section 3.8.1 that for a location M-estimator 7 
W(X ~~ Hoo) 


nw 


IF-(x), F) = 
Ao F) Ey! (x — ity) 


(3.7) 
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and for a scale M-estimator G (Section 2.5) 


p (Xo/ Go) —6 
* E(X/ 600) 0"(x/G 09) 
For the median estimator, the denominator is to be interpreted as in (2.27). The sim- 
ilarity between the IF and the SC of a given estimator can be seen by comparing 
Figure 3.1 to Figures 2.3 and 2.5. The same thing happens with Figure 3.2. 

We see above that the IF of an M-estimator is proportional to its y-function (or an 
offset p-function in the case of the scale estimator), and this behavior holds in general 
for M-estimators. Given a parametric model Fg, a general M-estimator 0 is defined 
as a solution of 


IF s(x, F) = 6. (3.8) 


Y Hx; 8) = 0. (3.9) 
i=1 


For location, Y(x, 0) = w(x — 6), and for scale, ¥(x, 0) = p(x/0) — 6. It is shown in 
Section 10.2 that the asymptotic value @,, of the estimator at F satisfies 


E(x, 6,.) = 0. (3.10) 


It is shown in Section 3.8.1 that the IF of a general M-estimator is 


Y(xqs Ooo) 
IF9(%, F) = -——— (3.11) 
B(O,,, ¥) 
where 3 
BO, Y) = ager 0), (3.12) 


and thus the IF is proportional to the y-function P(x, 6..). 
If Y is differentiable with respect to 0, and the conditions that allow the inter- 
change of derivative and expectation hold, then 


B(O, ¥) = EW(x, 6) (3.13) 
where 
(x, 0) = ae (3.14) 


The proof is given in Section 3.8.1. Then, if @ is consistent for the parametric family 


Fo, (3.11) becomes 


(x9, 0) 
IF3(xo, Fy) = -—"—. 
E- W(x, 0) 


Consider now an M-estimator 77 of location, with known dispersion o, where the 


asymptotic value j/,, satisfies 
x — Bs 
E =0. 
FW ( = ) 
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It is easy to show, by applying (3.7) to the estimator defined by the function y*(x) = 
w(x/o), that the IF of 77 is 


W(X — Hoo)/2) 
Epw!((x— fig)/0) 


Now consider location estimation with a previously computed dispersion esti- 
mator 6 as in (2.66). In this case, the IF is much more complicated than the one 
above, and depends on the IF of G. But it can be proved that if F is symmetric, the IF 
simplifies to (3.15): 


IF,(%, F) =o (3.15) 


* Ep! (x = foo) /Fo9) 
The IF for simultaneous estimation of 4 and o is more complicated, but can be derived 
from (3.48) in Section 3.6. 


It can be shown that the IF of an a-trimmed mean ff at a symmetric F is propor- 
tional to Huber’s y-function: 


IF 2(%9, F) = CG (3.16) 


A(x — 

IF (x9, F) = Mi Hes) (3.17) 
with k = F-'(1—a). Hence the trimmed mean and the Huber estimator in the 
example at the beginning of the chapter not only have the same asymptotic vari- 
ances, but also the same IF. However, Table 3.1 shows that they have very different 
degrees of robustness. 

Comparing (3.7) to (2.24) and (3.17) to (2.41), one sees that the asymptotic vari- 
ance v of these M-estimators satisfies 


v = E,lF(x, F)’. (3.18) 


It is shown in Section 3.7 that (3.18) holds for a general class of estimators called 
Fréchet-differentiable estimators, which includes M-estimators with bounded W. 
However, the relationship (3.18) does not hold in general. For instance, the Shorth 
location estimator (the midpoint of the shortest half of the data; see Problem 2.16a) 
has a null IF (Problem 3.12). At the same time, its rate of consistency is n/3 
rather than the usual rate n~!/2. Hence the left-hand side of (3.18) is infinite and the 
right-hand is zero. 


3.1.1 *The convergence of the SC to the IF 


The plot in the upper left panel of Figure 3.1 for the Huber estimator using the SD as 
the previously computed dispersion estimator seems to contradict the convergence of 
SC,,(Xo) to IF(xp). Note, however, that (3.6) asserts only the convergence for each xq. 
This means that SC,,(xp) will be near IF(xq) for a given xy when n is sufficiently large, 
but the value of n will in general depend on x,; in other words, the convergence will 
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not be uniform. Rather than convergence at an isolated point, what matters is being 
able to compare the influence of outliers at different locations; that is, the behavior 
of the whole curve corresponding to the SC. Both curves will be similar along their 
whole range only if the convergence is uniform. This does not happen with H(SD). 

On the other hand Croux (1998) has shown that when @ is the median, the distri- 
bution of SC, (xq) does not converge in probability to any value, and hence (3.6) does 
not hold. This would seem to contradict the upper right panel of Figure 3.1. However, 
the form of the curve converges to the correct limit in the sense that for each xp 


SC, (x9) IF(%) 


max.|SC.()|/ oa TTT x — Med(x)). J 
max,|SC,(0| ** max,|IFG@a| ~ S&8*~ Meda) (3.19) 


The proof is left to the reader (Problem 3.2). 


3.2 The breakdown point 


Table 3.1 showed the effect of replacing several data values by outliers. Roughly 
speaking, the breakdown point (BP) of an estimator 6 of the parameter 6 is the largest 
amount of contamination (proportion of atypical points) that the data may contain 
such that 0 still gives some information about 0, that is, about the distribution of the 
“typical” points. 

Let @ range over a set ©. In order for the estimator 6 to give some information 
about 0, the contamination should not be able to drive @ to infinity or to the boundary 
of © when it is not empty. For example, for a scale or dispersion parameter, we have 
© = [0, oo], and the estimator should remain bounded, and also bounded away from 
0, in the sense that the distance between 6 and 0 should be larger than some positive 
value. 


Definition 3.1 The asymptotic contamination BP of the estimator 6 at F, denoted 
by €*(0, F), is the largest e* € (0, 1) such that for € < €*, 0,,((1 — €)F + €G) remains 
bounded away from the boundary of ® for all G. 


The definition means that there exists a bounded and closed set K C © such that 
K nN 0© = % (where 0O denotes the boundary of ©) such that 


6,.((1 — €)F + €G) € KVe < e*and VG. (3.20) 


It is helpful to extend the definition to the case when the estimator is not uniquely 
defined, for example when it is the solution of an equation that may have multiple 
roots. In this case, the boundedness of the estimator means that all solutions remain 
in a bounded set. 

The BP for each type of estimator has to be treated separately. Note that it is easy 
to find estimators with high BP. For instance, the “estimator” identically equal to zero 
has e* = 1! However, for “reasonable” estimators it is intuitively clear that there must 
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be more “typical” than “atypical” points and so e* < 1/2. Actually, it can be proved 
(Section 3.8.2) that all shift equivariant location estimators as defined in (2.4) have 
e* < 1/2. 


3.2.1 Location M-estimators 


It will be convenient first to consider the case of monotonic but not necessarily odd 
yw. Assume that 


ky = —y(-00), ky = yoo) 


are finite. Then it is shown in Section 3.8.3 that 


ae oe (3.21) 
It follows that if y is odd, then k, = k, and the bound e* = 0.5 is attained. Define 
ge io (j= 1,2). (3.22) 
J ky +k 


Then, (3.21) is equivalent to 
e* = min(€e}, €5). 


The proof of (3.21) shows that e} and €5 are respectively the BPs to +00 and to 
—co. It can be shown that redescending estimators also attain the bound e* = 0.5, 
but the proof is more involved since one has to deal, not with (2.19), but with the 
minimization (2.13). 


3.2.2 Scale and dispersion estimators 


We deal first with scale estimators. Note that while a high proportion of atypical 
points with large values (outliers) may cause the estimator G to overestimate the true 
scale, a high proportion of data near zero (“inliers”) may result in underestimation 
of the true scale. Thus it is desirable that the estimator remains bounded away from 
zero (“implosion”) as well as away from infinity (“explosion”). This is equivalent to 
keeping the logarithm of 6 bounded. 

Note that a scale M-estimator with p-function p may be written as a location 
M-estimator “in the log scale”. Put 


y = log|x|, # = logo, w(t) = ple’) — 6. 


Since p is even and p(0) = 0, then 


o(2)-5=0( 2) -3-wo-w, 


and hence G = exp(ji), where jf verifies ave(y(y — 2)) = 0, and hence // is a location 
M-estimator. 
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If p is bounded, we have p(oo) = 1 by Definition 2.1. Then the BP e*of G is given 
by (3.21) with 
k, =6, k= 1-6, 


and so 
e* = min(6, 1 — 6). (3.23) 


Since 4 > +co and pf > —oo are equivalent to o > co and o — 0 respectively, it 
follows from (3.22) that 6 and 1 — 6 are, respectively, the BPs for explosion and for 
implosion. 

As for dispersion estimators, it is easy to show that the BPs of the SD, the MAD 
and the IQR are 0, 1/2 and 1/4, respectively (Problem 3.3). In general, the BP of an 
equivariant dispersion estimator is < 0.5 (Problem 3.5). 


3.2.3 Location with previously-computed dispersion estimator 


In Table 3.1 we saw the bad consequences of using an M-estimator ## with the SD as 
the previously computed dispersion estimator G. The reason is that the outliers inflate 
this dispersion estimator, and hence outliers do not appear as such in the “standard- 
ized” residuals (x; — #)/G. Hence the robustness of G is essential for that of 77. 

For monotone M-estimators with bounded and odd yw, it can be shown that 
e*(ft) = €*(6). Thus, if G is the MAD then e*(f7) = 0.5, but if G is the SD then 
e*(fi) = 0. 

Note that (3.16) implies that the location estimators using the SD and the MAD as 
previous dispersion have the same IF, while at the same time they have quite different 
BPs. Note that this is an example of an estimator with a bounded IF but a zero BP. 

For redescending M-estimators (2.66) with bounded p, the situation is more com- 
plex. Consider first the case of a fixed o. It can be shown that e*(f7) can be made 
arbitrarily small by taking o small enough. This suggests that for the case of an esti- 
mator o, it is not only the BP of @ that matters but also the size of G. Let jij be an 
initial estimator with BP = 0.5 (say the median), and let 6 be an M-scale centered at 


Ho, as defined by 
ly x; — Ho 
- =0.5 
n Doro ( o ) 


where po is another bounded p-function. If p < po, then e*(f/) = 0.5 (a proof is given 
in Section 3.8.3). 

Since the MAD has p(x) = I(x > 1), it does not fulfill p < pg. In this case the 
situation is more complicated and the BP will in general depend on the distribu- 
tion (or on the data in the case of the finite-sample BP introduced below). Huber 
(1984) calculated the BP for this situation, and it follows from his results that for the 
bisquare p with MAD scale, the BP is 1/2 for all practical purposes. Details are given 
in Section 3.8.3.2. 
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3.2.4 Simultaneous estimation 


The BP for the estimators in Section 2.7.2 is much more complicated, requiring the 
solution of a nonlinear system of equations (Huber and Ronchetti, 2009, p. 141). In 
general, the BP of // is less than 0.5. In particular, using Huber’s y;, with 6 given by 
(2.76) yields 
: 0.675 
*=min (0.5, >=), 
c= min ( k +0.675 


so that with k = 1.37 we have e* = 0.33. This is clearly lower than the BP = 0.5, 
which corresponds to using a previously computed dispersion estimator treated 
above. 


3.2.5 Finite-sample breakdown point 


Although the asymptotic BP is an important theoretical concept, it may be more use- 
ful to define the notion of BP for a finite sample. Let 6, = 6,(x) be an estimator 
defined for samples x = {x,,...,x,,}. The replacement finite-sample breakdown point 
(FBP) of 6, at x is the largest proportion e*(6,, x) of data points that can be arbitrar- 
ily replaced by outliers without 6, leaving a set which is bounded, and also bounded 
away from the boundary of © (Donoho and Huber, 1983). More formally, call ¥,,, the 


set of all datasets y of size n having n — m elements in common with x: 
Xm ={y: #(y) =n, #H(xNy) =n—m}. 


Then 


* 


e(6,,x)=—, (3.24) 
n 
where 


m* = max {m >0: 6,(y) bounded and also bounded away from 00 V y € %,, 


(3.25) 
In most cases of interest, €* does not depend on x, and tends to the asymptotic BP 


an 


when n > oo. For equivariant location estimators, it is proved in Section 3.8.2 that 


es-( | (3.26) 
n 2. 
and that this bound is attained by M-estimators with odd and bounded yw. For the 
trimmed mean, it is easy to verify that m* = [na], so that e* ~ a for large n. 

Another possibility is the addition FBP. Call &,, the set of all datasets of size 
n-+m containing x: 

Kn = ly: Hy) =n+m, xCy}. 

Then 


m* 


xD 
En (Os x)= 


° 
n+m 
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where 


n~ 


m* = max {m > 0: 0,4,,(y) bounded and also bounded away from d OV y € Xn} é 


n+m 


Both e* and e** give similar values for large n, but we prefer the former. The main 
reason for this is that the definition involves only the estimator for the given n, which 
makes it easier to generalize this concept to more complex cases, as will be seen in 
Section 4.6. 


3.3. Maximum asymptotic bias 


The IF and the BP consider extreme situations of contamination. The first deals with 
“infinitesimal” values of €, while the second deals with the largest € an estimator can 
tolerate. Note that an estimator having a high BP means that 0,,(F) will remain in a 
bounded set when F ranges in an €-neighborhood (3.3) with € < e*, but this set may 
be very large. What we want to do now is, roughly speaking, to measure the worst 
behavior of the estimator for each given € < e*. 

We again consider F ranging in the e-neighborhood 


Fog ={d—-6)Fyt+eG: GEG} 


of an assumed parametric distribution Fy, where G is a family of distribution func- 
tions. Unless otherwise specified, G will be the family of all distribution functions, 
but in some cases it will be more convenient to choose a more restricted family such 
as that of point-mass distributions. The asymptotic bias of 6 at any F € Fg is 


b;(F, 0) = 6,.(F) — 8 
and the maximum bias (MB) is 
MB; (e, 8) = max{|ba(F, @)| : F € Fo}. 


In the case that the parameter space is the whole set of real numbers, the relation- 
ship between MB and BP is 


€*(0, Fy) = sup{e > 0 : MBg (¢,0) < oo}. 


Note that two estimators may have the same BP but different MBs (Problem 3.11). 
The contamination sensitivity of @ at ™ is defined as 


7(0,0) = |<op, (€.0)| (3.27) 


In the case that 6 is consistent, we have 6,.(Fy) = 0 and then MB;(0, @) = 
ba(Fo, 0) = 0. Therefore y, gives an approximation to the MB for small e: 


MB,(e, 0) © €y-(0, 0). (3.28) 


MAXIMUM ASYMPTOTIC BIAS 63 


Blas 
1.5 
1 


0.0 0.1 0.2 0.3 0.4 0.5 
Fraction of Outliers 


Figure 3.3. Maximum bias of Huber estimator (—) and its linear aproximation (.....) 
as a function of € 


Note, however, that since MB, (e*, 0) = o, while the right-hand side of (3.28) always 
yields a finite result, this approximation will be quite unreliable for sufficiently large 
values of €. Figure 3.3 shows MBa(e, @) at Fy = N(@, 1) and its approximation (3.28) 
for the Huber location estimator with k = 1.37 (note that the bias does not depend on 
@ due to the estimator’s shift equivariance). 

The gross-error sensitivity (GES) of 6 at 0 is 


7*(8, 0) = max|IF4(x9, Fy)]. (3.29) 
Xo 
Since (1 — €)Fy + €6,, € F,, we have for all x9 
|0,,(1 — €)Fy + €6,.) — 0,,(F,)| < MBg(e, 0). 
So dividing by € and taking the limit we get 
y 2X6 (3.30) 


Equality above holds for M-estimators with bounded y-functions, but not in general. 
For instance, we have seen in Section 3.2.3 that the IF of the Huber estimator with 
the SD as previous dispersion is bounded, but since e* = 0 we have MBa(e, 0) = o0 
for all € > 0 and so the right-hand side of (3.30) is infinite. 
For location M-estimators f7 with odd y and k = y(co), and assuming a location 
model FQ) = F(x — “), we have 
k k 


Yaw) = ————— = ——— (3.31) 
Ep WH) Ep!) 


so that y*(7, 4) does not depend on yp. 
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In general for equivariant estimators, MBa(e, 8) does not depend on @. In par- 
ticular, the MB for a bounded location M-estimator is as given in Section 3.8.4, 
where it is shown that the median minimizes the MB for M-estimators in symmetric 
models. 


3.4 Balancing robustness and efficiency 


In this section we consider a parametric model Fp and an estimator 0, which is con- 
sistent for @ and such that the distribution of Vn, — 0) under Fy, tends to a normal 
distribution with mean 0 and variance v = v(6, —). This is the most frequent case and 
contains most of the situations considered in this book. 

Under the preceding assumptions, 6 has no asymptotic bias and we care only 
about its variability. Let v,,;, = Umin(@) be the smallest possible asymptotic variance 
within a “reasonable” class of estimators (for example, equivariant). Under reason- 
able regularity conditions, v,,;, 1s the asymptotic variance of the MLE for the model 
(Section 10.8). Then the asymptotic efficiency of 6 at @ is defined as v,,,,(0)/v(6, 0). 

If instead F does not belong to the family F, but is in a neighborhood of Fy, the 
squared bias will dominate the variance component of MSE for all sufficiently large 
n. To see this, let b = 6. (F) — @ and note that in general under F the distribution of 
Vn, = 0.0) tends to normal, with mean 0 and variance v. Then the distribution of 
6, — 0 is approximately N(b, v/n), so that the variance tends to zero while the bias 
does not. Thus we must balance the efficiency of 6 at the model F With the bias ina 
neighborhood of it. 

We have seen that location M-estimators with a bounded yw and previously com- 
puted dispersion estimator with BP = 1/2 attain the maximum BP of 1/2. To choose 
among them we must compare their biases for a given efficiency. We consider the 
Huber and bisquare estimators with previously computed MAD dispersion and effi- 
ciency 0.95. Their maximum biases for the model Fg = {(1 — €)Fy +eG : GE G} 
with F', = N(0,1) and a few values of € are shown in Table 3.2. 

Figure 3.4 shows the respective biases for point contamination at K with e = 0.1, 
as a function of the outlier location K. It is seen that although the maximum bias of 
the bisquare is higher, the difference is very small and its bias remains below that 


Table 3.2 Maximum contamination biases of 
Huber and bisquare location estimators 


E 0.05 0.10 0.20 


Huber 0.087 0.184 0.419 
Bisquare 0.093 0.197 0.450 
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Figure 3.4 Asymptotic biases of Huber and bisquare estimators for 10% contami- 
nation as functions of the outlier location K 


Table 3.3 Asymptotic efficiencies of three 
location M-estimators 


Huber Bisq. CMLE 
Normal 0.95 0.95 0.60 
Cauchy 0.57 0.72 1.00 


of the Huber estimator for the majority of the values. This shows that, although the 
maximum bias contains much more information than the BP, it is not informative 
enough to discriminate among estimators and that one should look at the whole bias 
behavior when possible. 

To study the behavior of the estimators under symmetric heavy-tailed distribu- 
tions, we computed the asymptotic variances of the Huber and bisquare estimators, 
and of the Cauchy MLE (“CMLE”), with simultaneous dispersion (Section 2.7.2) for 
the normal and Cauchy distributions, the latter of which can be considered an extreme 
case of heavy-tailed behavior. The efficiencies are given in Table 3.3. It is seen that 
the bisquare estimator yields the best trade-off between the efficiencies for the two 
distributions. 

For all the above reasons we recommend when estimating location the bisquare 
M-estimator with previously computed MAD. 
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3.5 *“Optimal’ robustness 


In this section we consider different ways in which an “optimal” estimator may be 
defined. 


3.5.1 Bias- and variance-optimality of location estimators 
3.5.1.1 Minimax bias 


If we pay attention only to bias, the quest for an “optimal” location estimator is 
simple: Huber (1964) has shown that the median has the smallest maximum bias 
(“minimax bias’) among all shift equivariant estimators if the underlying distribution 
is symmetric and unimodal. See Section 3.8.5 for a proof. 


3.5.1.2 Minimax variance 


Huber (1964) studied location M-estimators in neighborhoods (3.3) of a symmetric 
F with symmetric contamination (so that there is no bias problem). The dispersion 
is assumed known. Call v(6, H) the asymptotic variance of the estimator 6 at the 
distribution H, and 


n 


v.(0) = sup v(6, H), 


HEF (F,e) 


where F(F, €) is the neighborhood (3.3) with G ranging over all symmetric distribu- 
tions. Assume that F has a density f and that yw) = —f’/f is nondecreasing. Then the 
M-estimator minimizing v,(0) has 


@e ywo(x) if |ywo(x)| <k 
cas k sgn(x) else 


where k depends on F and «. For normal F, this is the Huber y;,. Since wo corresponds 
to the MLE for f, the result may be described as a truncated MLE. 

The same problem with unknown dispersion was considered by Li and Zamar 
(1991). 


3.5.2 Bias optimality of scale and dispersion estimators 


The problem of minimax bias scale estimation for positive random variables was 
considered by Martin and Zamar (1989), who showed that for the case of a nominal 
exponential distribution the scaled median Med(x)/0.693 — as we will see in Prob- 
lem 3.15, this estimator also minimizes the GES — was an excellent approximation to 
the minimax bias optimal estimator for a wide range of € < 0.5. Minimax bias disper- 
sion estimators were treated by Martin and Zamar (1993b) for the case of a nominal 
normal distribution and two separate families of estimators: 
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e For simultaneous estimation of location and scale/dispersion with the monotone 
y-function, the minimax bias estimator is well approximated by the MAD for 
all € < 0.5, thereby providing a theoretical rationale for an otherwise well-known 
high-BP estimator. 

e For M-estimators of scale with a general location estimator that includes location 
M-estimators with redescending y-functions, the minimax bias estimator is well 
approximated by the Shorth dispersion estimator (the shortest half of the data, see 
Problem 2.16b) for a wide range of € < 0.5. This is an intuitively appealing esti- 
mator with BP = 1/2. 


3.5.3. The infinitesimal approach 


Several criteria have been proposed to define an optimal balance between bias and 
variance. The treatment can be simplified if € is assumed to be “very small”. Then 
the maximum bias can be approximated through the gross-error sensitivity (GES) 
(3.29). We first treat the simpler problem of minimizing the GES. Let Fy be a para- 
metric family with densities or frequency functions f(x). Call E, the expectation with 
respect to Fg; that is, if the random variable z ~ Fy and h is any function, 


EA = | h@ofadx — (z continuous) 
ones DA @Ofe(x) (z discrete). 


We shall deal with general M-estimators 6, defined by (3.9), where ¥ is usually 
called the score function. An M-estimator is called Fisher-consistent for the family 
Fy if: 

E,¥(x, 0) = 0. (3.32) 


In view of (3.10), a Fisher-consistent M-estimator is consistent in the sense of (3.2). 
It is shown in Section 10.3 that if @, is Fisher-consistent, then 


n'?2@ — 0) +, NO, v(¥, 6)), 


with ee 
wei gje 
BO, ¥)2 
where B is defined in (3.12) and 
A(0, P) = Ey (P(x, 0)”). (3.33) 


It follows from (3.11) that the GES of an M-estimator is 


max,|‘P(x, 4)| 


*(0,0) = 
rO9= Rew 
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The MLE is the M-estimator with score function 
St g(X) 
f(x)’ 


It is shown in Section 10.8 that this estimator is Fisher-consistent; that is, 


with Ao) = 2. (3.34) 


Polx, 0) = 00 


Ep ¥o(x, 0) = 0, (3.35) 


and has the minimum asymptotic variance among Fisher-consistent M-estimators. 
We now consider the problem of minimizing y* among M-estimators. To ensure 
that the estimates consider the correct parameter, we consider only Fisher-consistent 
estimators. 
Call Med, the median under Fp; that is, if z ~ Fg and h is any function, then 
Med,(h(z)) is the value t where 


[1009 < thfo(x)dx — 0.5 


changes sign. 
Define 
M(@) = Medy¥ (x, 8). 


It is shown in Section 3.8.6 that the M-estimator @ with score function 
(x, 0) = sgn(¥o(x, 0) — M(0)) (3.36) 


is Fisher-consistent and is the M-estimator with smallest y* in that class. 

This estimator has a clear intuitive interpretation. Recall that the median is a 
location M-estimator with y-function equal to the sign function. Likewise, @ is the 
solution 6 of 

Med{'¥,(x,,0),..., Zo(x,,,0)} = Med, P(x, 8). (3.37) 


Note that, in view of (3.35), the MLE may be written as the solution of 
2 YY Wo(xj,0) = Ey PoC 0). (3.38) 
n 
i=l 


Hence (3.37) can be seen as a version of (3.38), in which the average on the left-hand 
side is replaced by the sample median, and the expectation on the right is replaced by 
the distribution median. 


3.5.4 The Hampel approach 


Hampel (1974) stated the balance problem between bias and efficiency for general 
estimators as minimizing the asymptotic variance under a bound on the GES. For a 
symmetric location model, his result coincides with Huber’s. It is remarkable that 
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both approaches coincide at the location problem, and furthermore the result has a 
high BP. 
‘To simplify the notation, in this section we will write y*(W,0) for the GES 
v0, @) of an M-estimator 6 with score function ¥. Hampel proposed to choose an 
M- estimator combining efficiency and robustness by finding such that, subject 
to (3.32), 
v(¥, 0) = min with y*(¥, 0) < G(@), (3.39) 


where G(@) is a given bound expressing the desired degree of robustness. It is clear 
that a higher robustness means a lower G(@), but that this implies a higher v('¥, 6). 
We call this optimization problem Hampel’s direct problem. 
We can also consider a dual Hampel problem, in which we look for a function Y 
such that 
y*(¥, 0) = min with v(¥, 0) < V(6), (3.40) 


with given V. It is easy to see that both problems are equivalent in the following sense: 
if ‘¥* is optimal for the direct Hampel problem, then it is also optimal for the dual 
problem with V(@) = v(¥*, 0). Similarly if ‘* is optimal for the dual problem, it is 
also optimal for the direct problem with G(@) = y*(%, 6). 

The solution to the direct and dual problems was given by Hampel (1974). The 
optimal score functions for both problems are of the following form: 


B*(x, 0) = Wagy(Bo(x, 8) — r(0)) (3.41) 


where Po is given by (3.34), y; is Huber’s y-function (2.29), and r(@) and k(0) are 
chosen so that that ‘¥* satisfies (3.32). A proof is given in Section 3.8.7. It is seen that 
the optimal score function is obtained from ‘Y, by first centering through r and then 
bounding its absolute value by k. Note that (3.36) is the limit case of (3.41) when 
k + 0. Note also that for a solution to exist, G(@) must be larger than the minimum 
GES y*(¥, 6), and V(@) must be larger than the asymptotic variance of the MLE: 
u(Yo, 8). 

It is not clear which one may be a practical rule for the choice of G(@) for the direct 
Hampel problem. But for the second problem a reasonable criterion is to choose V(6) 


. v(W, 0) 


l-a 


V(6) = ; (3.42) 
where | — a is the desired asymptotic efficiency of the estimator with respect to 
the MLE. 

Finding k for a given V or G may be complicated. The problem simplifies consid- 
erably when Fy is a location or a scale family, for in these cases the MLE is location 
(or scale) equivariant. We shall henceforth deal with bounds (3.42). We shall see that 
k may be chosen as a constant, which can then be found numerically. 

For the location model we know from (2.19) that 

Fi) 


Wo(x, €) = E(x — 8) with €(x) = Aer (3.43) 
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Hence u(‘¥o, 8) does not depend on @, and 
W*(x, 0) = w(Eg(x — 8) — r(8)). (3.44) 


If k(@) is constant, then the r(@) that fulfills (3.32) is constant too, which implies 
that ‘¥*(x, 0) depends only on x — 9, and hence the estimator is location equivariant. 
This implies that u(¥*, 0) does not depend on @ either, and depends only on k, which 
can be found numerically to attain equality in (3.40). 

In particular, if fy is symmetric, it is easy to show that r = 0. When fp = N(0, 1) 
we obtain the Huber score function. 

For a scale model it follows from (2.48) that 


Wo(x g) = 26 (F)—1, 


with &) as in (3.43). It follows that v(Wo, @) is proportional to 0, and that ‘P* has the 
form 


(x, 8) = we (2 (2) - 100). (3.45) 


If k is constant, then the r(@) that fulfills (3.32) is proportional to 8, which implies 
that ‘Y*(x, 0) depends only on x/0, and hence the estimator is scale equivariant. This 
implies that v(¥*,@) is also proportional to 67, and hence k, which can be found 
numerically to attain equality in (3.40). 

The case of the exponential family is left for the reader in Problem 3.15. Exten- 
sions of this approach when there is more than one parameter may be found in Hampel 
et al. (1986). 


3.5.5 Balancing bias and variance: the general problem 


More realistic results are obtained by working with a positive (not “infinitesimal’) €. 
Martin and Zamar (1993a) found the location estimator minimizing the asymptotic 
variance under a given bound on the maximum asymptotic bias for a given € > 0. 
Fraiman et al. (2001) derived the location estimators minimizing the MSE of a given 
function of the parameters in an €-contamination neighborhood. This allowed them to 
derive “optimal” confidence intervals that retain the asymptotic coverage probability 
in a neighborhood. 


3.6 Multidimensional parameters 


We now consider the estimation of p parameters 0),...,6, (¢.g., location and dis- 
persion), represented by the vector 0 = (6,,...,6,)’. Let 0, be an estimator with 
asymptotic value 00. Then the asymptotic bias is defined as 


b 9(F, 6) = disc@,,(F), 8), 
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where disc(a, b) is a measure of the discrepancy between the vectors a and b, which 
depends on the particular situation. In many cases one may take the Euclidean dis- 
tance || a — b ||, but in other cases it may be more complex (as in Section 6.7). 

We now consider the efficiency. Assume 6, is asymptotically normal with covari- 
ance matrix V. Let 6, be the MLE, with asymptotic covariance matrix V,. For ¢ € R? 
the asymptotic variances of linear combinations <6, and ¢’ 6, are respectively ¢’Ve 
and c’Voe, and their ratio would yield an efficiency measure for each ce. To express 
them though a single number, we take the worst situation, and define the asymptotic 
efficiency of 6, as 


“ _ ¢Voe 
eff(@,,) = min : 
c#0 c/Ve 
It is easy to show that 
eff(6,,) = A,(V~!V), (3.46) 


where 4,(M) denotes the smallest eigenvalue of the matrix M. 

In many situations (as in Section 4.4) V = aVo, where a is a constant, and then 
the efficiency is simply 1 /a. 

Consider now simultaneous M-estimators of location and _ dispersion 
(Section 2.7.2). Here we have two parameters, and o, which satisfy a system of 
two equations. Put @ = (uv, 0), and 


X—H xX-—wU 
Yi, 0A) =y (—*) and Y5(x, 9) = Decale (—*) —6. 
Then the estimators satisfy 


>; Vx; 6) = 0, (3.47) 
i=1 


with ¥ = (¥,,'¥,). Given a parametric model Fg, where @ is a multidimensional 
parameter of dimension p, a general M-estimator is defined by (3.46), where ¥W = 
CPisstes tp) 

Then (3.11) can be generalized by showing that the IF of Gis 


IF (x9, F) = —B7 W(x, Ooo), (3.48) 


6-6 } 


M-estimators of multidimensional parameters are further considered in 
Section 10.5. It can be shown that they are asymptotically normal with asymptotic 
covariance matrix 


where the matrix B has elements 


oY (x, 0) 
B,=E4 ——_ 


V = B(EW(x9, OVX, 0B", (3.49) 
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This therefore verifies the analogue of (3.18): 
V = E{IF(, F)IF(, F)’}. (3.50) 


The results in this section hold also when the observations x are multidimensional. 


3.7. *Estimators as functionals 


The mean value may be considered as a “function” that attributes to each distribu- 
tion F its expectation (when it exists); and the sample mean may be considered as a 
function attributing to each sample {x,,...,x,,} its average x. The same can be said 
of the median. This correspondence between distribution and sample values can be 
made systematic in the following way. Define the empirical distribution function of 
a sample x = {x,,...,x,} as 


= 1 n 
FaxO =~ DIG; <9) 
i=1 


(the argument x will be dropped when there is no ambiguity). Then, for any contin- 
uous function g, 


1 n 
Ep g(a) = — a(x). 
i=l 
Define a “function” T whose argument is a distribution (a functional) as 
T(F) = Epx = [dro 


It follows that T(F,) =x. If x is an iid. sample from F, the law of large numbers 
implies that T(F',) +, T(F) when n = ov. 

Likewise, define the functional T(F) as the 0.5 quantile of F; if it is not 
unique, define 7(F’) as the midpoint of 0.5 quantiles (see Section 2.10.4). Then 
T(F) = Med(x) for x ~ F, and T(F,,) = Med(x,,...,x,,). If x is a sample from F and 
T(F) is unique, then T(F,,) =A T(F). 

More generally, M-estimators can be cast in this framework. For a given Y, define 
the functional T(F) as the solution 8 (assumed unique) of 


E-W(x, 0) = 0. (3.51) 


Then T(F,) is a solution of 


1 n 
Bp YO, 0) = — 2 W(x,,0) = 0. (3.52) 
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We see that T(F ,) and T(F) correspond to the M-estimator 6, and to its asymptotic 
value 6... (F), respectively. 

A similar representation can be found for L-estimators. In particular, the 
a-trimmed mean corresponds to the functional 


TF) = = Ertl < F(x) < 1-2). 
Almost all of the estimators considered in this book can be represented as func- 
tionals; that is, 
0, = T(F,,) (3.53) 


for some functional T. The intuitive idea of robustness is that “modifying a small 
proportion of observations causes only a small change in the estimator’. Thus robust- 
ness is related to some form of continuity. Hampel (1971) gave this intuitive con- 
cept a rigorous mathematical expression. The following is an informal exposition of 
these ideas; mathematical details and further references can be found in Huber and 
Ronchetti (2009; Ch. 3). 

The concept of continuity requires the definition of a measure of distance d(F’, G) 
between distributions. Some distances (the Lévy, bounded Lipschitz, and Prokhorov 
metrics) are adequate to express the intuitive idea of robustness, in the sense that if 
the sample y is obtained from the sample x by 


e arbitrarily modifying a small proportion of observations, and/or 
e slightly modifying all observations, 


then dF el n Figo is “small”. Hampel (1971) defined the concept of qualitative robust- 
ness. A simplified version of his definition is that an estimator corresponding to a 
functional T is said to be qualitatively robust at F if T is continuous at F accord- 
ing to the metric d; that is, for all € there exists 6 such that d(F,G) < 6 implies 
|T(F) — T(G)| < «. a 

It follows that robust estimators are consistent, in the sense that T(F,,) converges 
in probability to T(F). To see this, recall that if x is an i.i.d. sample from F, then 
the law of large numbers implies that F(t) =ty F(t) for all t. A much stronger result 
called the Glivenko—Cantelli theorem (Durrett 1996) states that F, — F uniformly 
with probability 1; that is, 


P (swif,co — F(p)| > 0) = 
t 
It can be shown that this implies dF, F) —, 0 for the Lévy metric. Moreover, if 
T is continuous then 
Io = TE) T (plim, oF, = plim,..o.7F;, )= _ plim, 09. 


where “plim” stands for “limit in probability”. 
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A general definition of BP can be given in this framework. For a given metric, 
define an €-neighborhood of F as 


Ue, F) ={G: d(F,G) < «}, 
and the maximum bias of T at F as 
b, = sup{|T(G) - T(F)| : GE Ve, F)}. 


For all the metrics considered, we have d(F,, G) < 1 for all F,G; hence U(1, F) 
is the set of all distributions, and b, = sup{|7(G) — T(F)| : all G}. Then the BP of 
T at F is defined as 

e€* =sup{e: b, < by}. 


In this context, the IF may be viewed as a derivative. It will help to review some 
concepts from calculus. Let h(z) be a function of m variables, with z = (z), ..., Z,,) € 
R”. Then h is differentiable at z, if there exists a vector d = (dj, .., d,,,) such that for 


m 
all z 
m 


h(z) — h(2) = ¥* d(zj — 2) + o(llz - Zoll). (3.54) 
j=l 


where “o” is a function such that lim,_,, o(t)/t = 0. This means that in a neighborhood 
of Zp), h can be approximated by a linear function. In fact, if z is near Z) we have 


A(z) & h(Zo) + LZ — %), 


where the linear function L is defined as L(z) = d’z. The vector d is called the deriva- 
tive of h at Z, which will be denoted by d = D(h, Zo). 


The directional derivative of h at Zp in the direction a is defined as 
h(Zy + ta) — h(z, 
D(h, Zp, a) = lim — 
to 


If his differentiable, directional derivatives exist for all directions, and it can be shown 
that 
D(h, Z, a) = a’ D(h, Zo). 


The converse is not true: there are functions for which D(h, Zp, a) exists for all a, but 
D(h, Zg) does not exist. 

For an estimator @ represented as in (3.53), the IF may also be viewed as a direc- 
tional derivative of T as follows. Since 


(l-e)F +66, =F +€(6, —F), 


we have 1 
IFa(x0. F) = lim —{7T[F + E(6,, — F)|-T(F)}, 
Eeo0deE 


which is the derivative of T in the direction 6, — F. 
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In some cases, the IF may be viewed as a derivative in the stronger sense of (3.54). 
This means that T(H) — T(F) can be approximated by a linear function of H for all 
Hin a neighborhood of F, and not just along each single direction. For a given 6 
represented by (3.53) and a given F, put for brevity 


E(x) = IF A(x, F). 
Then T is Fréchet-differentiable if for any distribution H 
T(A) — T(F) = Egé(x) + o(d(F, A)). G.55) 


The class of Fréchet differentiable estimators contains M-estimators with a bounded 
score function. Observe that the function 


H — Eyé(x) = rf &(x)dH(x) 
is linear in H. Putting H = F in (3.55) yields 
Eré(x) = 0. (3.56) 


Some technical definitions are necessary at this point. A sequence z, of random 
variables is said to be bounded in probability (abbreviated as z,, = 0,(1)) if for each 
€ there exists K such that P(|z,,| > K) < e for all n; in particular, if z, >, z then z, = 
O,(1). We say that z, = O,(u,) if Z,/Uy = O,,(1), and that z, = 0,(u,,) if Zn/Un >, 0. 

It is known that the distribution of sup,{ VnlF,() — F(t)|} (the so-called 
Kolmogorov—Smirnoy statistic) tends to a distribution (see Feller, 1971), so that 
sup|F, (1) —F(t)| = O,(n-'/ ?). For the Lévy metric mentioned above, this fact 
implies that also dF Py= O,(n-"/?). Then, taking H = F,, in (3.55) yields 


n? 


6, - 6,.(F) = TF,) - TUF) = Ep, 6) + 0 (d (F,,,F)) 
ly : 
=" de) +0, (n-/?), (3.57) 


Estimators satisfying (3.57) (called a linear expansion of 6.) are asymptotically 
normal and verify (3.18). In fact, the ii.d. variables €(x;) have mean 0 (by (3.56)) 
and variance 

v = Epé(x)”. 


Hence , 
Vn, — 9q,) = + YE) + 0,(0), 
vn i=l 


which by the central limit theorem tends to N(0, v). 
For further work in this area, see Fernholz (1983) and Clarke (1983). 
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3.8 Appendix: Proofs of results 
3.8.1 IF of general M-estimators 
Assume for simplicity that ¥ exists. For a given x9, put for brevity 
F,=(1-©)F + €6, and 0, = 0,,(F,). 


Recall that by definition 
E,'¥(x, 09) = 0. (3.58) 


Then 6, verifies 
0=Ep PO, 0.) = (1 — €)Ep'P(, 6.) + €'P (xo, 8,). 


Differentiating with respect to € yields 


a6 
-E,'¥(x,8,) + (1 -€)5 


a0 
Ep B(x, 0.) + W(x, 0.) + W(x, 9) =0. (3.59) 
E E 


The first term vanishes at € = 0 by (3.58). Taking € | 0 above yields the desired result. 
Note that this derivation is heuristic, since it is taken for granted that 00, /de exists 

and that 8, — 09. A rigorous proof may be found in Huber and Ronchetti (2009). 
The same approach serves to prove (3.48) (Problem 3.9). 


3.8.2 Maximum BP of location estimators 


It suffices to show that € < e* implies |—e¢> e*. Let e < e*. For t€ R define 
F(x) = F(x — 1), and let 


H,=(1-6)F +eF,€F,, Hf =eF+(1-)F_,€ Fi_,, 


with 
F,={U-e)F+eG: GEG}, 


where G is the set of all distributions. Note that 
H,(x) = H7 (x - 0). (3.60) 
The equivariance of ff and (3.60) imply 
Moo(H,) = Hoo(H;) + tV 1. 
Since € <e*, jfi,,(H,) remains bounded when t— oo, and hence ji,,(H;) is 


unbounded; since H* € F_,, this implies 1 — € > e*. 
A similar approach proves (3.26). The details are left to the reader. 
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3.8.3 BP of location M-estimators 
3.8.3.1 Proof of (3.21) 


Put for a given G 
F, =(1-6)F + eG and yp, = ji,,(F,). 


Then 
(1 — e)Epw(x — ,) + EEgw(x — w,) = 0. (3.61) 


We shall prove first that e* is not larger than the right-hand side of (3.21). Let 
e < e*. Then for some C, |u| < C for all G. Take G= 6,» SO that 


(1 — e)Epwx — pe) + Ey (xp — Me) = 0. (3.62) 


Let x9 > oo. Since pu, is bounded, we have w(x, — H,) > ky. Since w > —k,, (3.62) 
yields 
0 > —-k, (1 —€) + €k, (3.63) 


which implies € < k,/(k, + ky). Letting x) — —oo yields likewise € < k,/(k, +k). 
We shall now prove the opposite inequality. Let e« > e*. Then there exists a 
sequence G,, such that 
Hen = rae (Gl —e)F + éG,,) 


is unbounded. Suppose it contains a subsequence tending to +oo. Then for this sub- 
sequence, x — H,, > —0O for each x, and since y < ky, (3.61) implies 


0<(—e) lim Epw(x — y,,,) + &ko, 


and since the bounded convergence theorem (Section 10.3) implies 


lim EpyCr— Hey) = Ex (tim wr pe9)) 
we have 
0 < -k,( —€) + €kp, 


This is the opposite inequality to (3.63), from which it follows that € > €} in (3.22). 


If instead the subsequence tends to —oo, we have € > €5. This concludes the proof. 


3.8.3.2 Location with previously estimated dispersion 


Consider first the case of monotone y. Since € < €*(G) is equivalent to G being 
bounded away from zero and infinity when the contamination rate is less than e, 
the proof is similar to that of Section 3.8.3.1. 

Now consider the case of a bounded p. Assume p < pp. We shall show that e* = 
0.5. Let € < 0.5 and let yy = (Yyy,---sYy,) be a sequence of data sets having m ele- 
ments in common with x, with m > n(1 — €). Call Hon the initial location estimator, 
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Gy the previous scale and jiy the final location estimator for y,. Then it follows from 
the definitions of figy, Gy and fiy that 


YNi 1 Yui — Hon 1 Yui — Hon 
rela) sade at) ade Cam) 0 
(3.64) 
Since fip and Gp) have BP = 0.5 > €, fipy — and hence 6, — remains bounded for any 
choice of yy. 


Assume now that there is a sequence yy such that fiy > 00. Let Dy = {i : yy; = 
x;}. Therefore 


_ 1x Yui — By | x; — fy 
lim — ——_ } > lim - >l—-e>0. 
fim eS) = sims De o(AG) 2 -e> os 


i=l iEDy on 


which contradicts (3.64), and therefore fin must be bounded, which implies e* > 0.5. 
We now deal with the case of bounded p when p < pp does not hold. Consider 

first the case of fixed o. Huber (1984) calculated the finite BP for this situation. For 

the sake of simplicity we treat the asymptotic case with point-mass contamination. 


Let 
x— “) 


y =Erp ( 
where F is the underlying distribution and po = fi,,(F): 
Hy = argminE,;p (—*) , 
H o 
It will be shown that 1 
* a, v4 
=—. 3.65 
is er (3.65) 


Consider a sequence xy tending to infinity, and let Fy = (1 — €)F + €6,,. For 
HER 


sad ro ("S4) oe 24) +00( 84). 


Let € < BP(j#) first. Then py = f,,(Fy) remains bounded when xy — oo. By the 
definition of sip, 


Ay(tn) 2 (1-67 + ep (22). 


Since ji,,(Fy) minimizes Ay, we have Ay(uy) < Ay(xy), and the latter tends to 
1 — e. The boundedness of jy implies that xy — 4, > co, and hence we have in the 
limit 
(l-é)jy+ex<l-e, 
which is equivalent to € < e*. The reverse inequality follows likewise. When p is the 


bisquare with efficiency 0.95, F = N(0, 1) and o = 1, we have e* = 0.47. Note that 
e€* is an increasing function of o. 
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In the more realistic case that o is previously estimated, the situation is more 
complicated; but intuitively it can be seen that the situation is actually more favorable, 
since the contamination implies a larger o. The procedure used above can be used to 
derive numerical bounds for €*. For the same p and MAD dispersion, it can be shown 
that e* > 0.49 for the normal distribution. 


3.8.4 Maximum bias of location M-estimators 


Let F id (x) = Fo(x — #) where Fy is symmetric about zero. Let y be a nondecreasing 
and bounded y-function and call k = y(co). The asymptotic value of the estimator 
is fi,,(F ) = H, and the bias for an arbitrary distribution H is fix.(H) — yu. Define for 
brevity the function 

g(b) = Ex, w(x + b), 


which is odd. It will be assumed that g is increasing. This holds either if y is increas- 
ing, or if Fy has positive density everywhere. 
Let € < 0.5. Then it will be shown that the maximum bias is the solution b, of 


the equation 
ke 
g(b) = 7 


Since the estimator is shift equivariant, it may be assumed without loss of general- 
ity that « = 0. Write, for brevity, 4, = (4). For a distribution H = (1 — €)F) + eG 
(with G arbitrary), , is the solution of 

(1 — €)g(—Hy) + EEG (x — uy) = 0. (3.67) 


Since |g(b)| < k, we have for any G 


(1 — €)g(—yy) — ek < 0 < (1 — €)g(—Hy) + Ek, 


(3.66) 


which implies 
ke ke 
~~ < 9o( < 
(ae ee 
and hence || < b,. By letting G = 6,, in (3.67) with x) > +o, we see that the 
bound is attained. This completes the proof. 
For the median, y(x) = sgn(x) and k = 1, and a simple calculation shows (recall- 


ing the symmetry of Fy) that g(b) = 2Fo(b) — 1, and therefore 


= gd 1 
b, = F5 (45). (3.68) 


To calculate the contamination sensitivity y., put b, = db, /de, so that hy = y,. 
Then differentiating (3.66) yields 
g (bbe = 


> 


(de? 
and hence (recalling by = 0) y, = k/g'(0). Since g’(0) = Enw’ (x), we see that this 
coincides with (3.31) and hence y, = y*. 
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3.8.5 The minimax bias property of the median 


Let Fp have a density fo(x) which is a nonincreasing function of |x| (a symmetric uni- 
modal distribution) . Let b, be the maximum asymptotic bias of the median given in 
(3.68). Let 6 be any location equivariant estimator. It will be shown that the maximum 
bias of 6 ina neighborhood F'(F, €), defined in (3.3), is not smaller than b,. 

Call F, the distribution with density 


(1 — €)fo(x) ifx<b, 
SQ) = : 
(1 —€)fo(x—2b,) otherwise. 
Then f, belongs to F(Fo, €). In fact, it can be written as 
fy = (1 = £)fo + €8, 
with 
1- 
E 


= (fo(x — 2b,) — fo(a) I > b,). 


g(x) = 


We must show that g is a density. It is nonnegative, since x € (b,,2b,) implies 
|x — 2b,| < |x|, and the unimodality of fy yields fo(x — 2b,) > fo(x); the same thing 
happens if x > 2b,. Moreover, its integral equals one, since by (3.68), 


J iole 260) ~ flair = 2F (0) - 1 = 
b, 7é 


Define 
F_(x) = F(x + 2b,), 


which also belongs to F(Fo, €) by the same argument. The equivariance of 0 implies 
that 
0,.(F,) — 0,.(F_) = 2b,, 


and hence \6,,F | and \0,(F_)| cannot both be less than b,. 


3.8.6 Minimizing the GES 


To avoid cumbersome technical details, we assume henceforth that Po, 0) 
has a continuous distribution for all 0. We prove first that the M-estimator 6 is 
Fisher-consistent. In fact, by the definition of the function M, 

Ey P(x, 0) = —Po(Yo(x, 0) < M(8)) + Py(Wo(x, 0) > M()) 


1 1 
=--+-=0. 
a 2 


Since max, | P(x, 0)| = 1, the estimator has GES 
1 


7(¥,0) = ——. 
[BO, ¥)| 
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It will be shown first that for any Fisher-consistent 7, 
BO, P) = E,'P(x, 0) o(x, 6). (3.69) 


We give the proof for the continuous case; the discrete case is similar. Condi- 
tion (3.32) may be written as 


Ey V(x, 0) = / 7 W(x, O)fy(x)dx = 0. 


ive) 


Differentiating the above expression with respect to 0 yields 


BOO, B) + / (x, O)fy(x)dx = 0, 


co 


and (3.33)—(3.34) yield 


BO, ) = — ih . W(x, OVfo(x)dx 


foe) 


= / * V(x, Oo (x, Ofy(x) dx = Ep P(x, 0) o(x, 0), 


oO 


as stated. Note that 0 /06 does not exist, and hence we must define B(6, p) through 
(3.12) and not (3.13). 
Now let C = {x : Yo(x, 0) > M(@)}, with complement C’. It follows from I(C’) = 
1 — I(C) that 7 
PAY = 1C)—1(C') = 2(C) = 1) = 1. 


Using (3.35) and (3.36) we have 
B(O, P) = Ey P(x, 0)Wo(x, 0) 
= 2E,Po(x, O)I(C) — EyBo(x, 0) = 2Ey%o(x, O)I(C). 


Hence 7 1 
7*(¥, 0) = ———__—___... (3.70) 
2|Ep¥o(x, A)1(C)| 
Consider a Fisher-consistent VY. Then 
“(B, 0) max,|‘P(x, 8)| 3.71) 
Y ? =" —— aA 7 
|BO, P)| 


Using (3.32) and (3.69) we have 
BO, V) = Eg P(x, 0)¥o(x, 8) 
= EyP(x, 0)(Bo(x, 0) — M()) 
= E,'¥(x, 0)(Po(x, 8) — M(A))I(C) 
+ E,¥(x, 0)(Yo(, 0) — M(0))(C’). (3.72) 


82 


Besides 


Similarly 
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[Ey P(x, AYP o(x, 0) — M(@))K(C)| 
< max |'¥(x, 0)|Eg(¥o(x, 8) — M())(C) 


= max |P(x, )| Gaze ay) - —— 


[Ep P(x, 0)(Po(x, 0) — M(A))(C’)| 
< — max |'¥(x, 0)|Eg(Yo(x, 0) — M(0))K(C’) 


= max | ‘P(x, 6)| Graze aC) + = 


Therefore, by (3.72)-(3.74) we get 


|B(O, P)| < 2 max |'P(x, #)|E, Po (x, A)I(C). 


Therefore, using (3.71), we have 


1 
*(P, 0) > ———_______.. 
VO. > TE ONO 


And finally (3.70) and (3.75) yield 


7°(¥,0) <7*(¥, 8). 


mo) 


— 


(3.73) 


(3.74) 


(3.75) 


The case of a discrete distribution is similar, but the details are much more 


involved. 


3.8.7 Hampel optimality 


It will be shown first that estimators with score function (3.41) are optimal for Hampel 
problems with certain bounds. 


Theorem 3.2 Given k(@), the function ¥* given by (3.41) and satisfying (3.32) is 
optimal for the direct Hampel problem with bound 


k(0) 


and for the dual Hampel problem with bound 


V,(0) = o(P*, 8). 
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Proof of Theorem 3.2: We shall show that ‘¥* solves Hampel’s direct problem. 
Observe that '¥* satisfies the side condition in (3.39), since by definition y*(¥*, 0) = 
g(8). Let now ¥ satisfy (3.32) and 


y"(¥, 8) < g(8). (3.76) 


We must show that 
vo(P, 0) > v(%, 6). (3.77) 


We prove (3.77) for a fixed 6. Since for any real number A ¥ 0, Av defines the 
same estimator as ‘PV, we can assume without loss of generality that 


BO, ¥) = BO, B*), (3.78) 
and hence max, (P(x, 9)|) 
= BE 
Then, condition (3.76) becomes 
max(|'P(x, 4)|) < k(@) (3.79) 


and (3.77) becomes A(0, '¥) > A(6, ‘¥*), so that we have to prove 
Ba, 0) > EP" 8) (3.80) 


for any ¥ satisfying (3.79). 
Call ‘¥5, the ML score function centered by r : 


Po, 0) = Pox, 8) — r(@). 
It follows from (3.69) and (3.32) that 
Eg V(x, 0)P5(x, 0) = BO, P). 
We now calculate EG ~). Recalling (3.78) we have 


Eo P(x, 0) = Ey {[¥(x, 0) — PE (x, 0)] + BE(x, A}? (3.81) 
= E,(Y(x, 0) — ¥5(x, 0)? + Ey'P6(x, 0)? 
+ 2Ey P(x, OVS — 2B, P(x, 8) 
= Ey (P(x, 0) — B(x, 0)” — Eg B(x, 0)? + 2B(0, B*). 


Since EW,(x, 0)? and B(6, ¥*) do not depend on ¥, it suffices to prove that putting 
Y = P* minimizes 
EAYG,0) = Eo, a) 
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subject to (3.79). Observe that for any function V(x, 0) satisfying (3.79) we have 
[P(x 8) — Pox, | = [IPG | — Kf [Yo | > k()}, 


and since 
|B" (x, A) — Pox, A)| = ||Po Cx A)] — KC@)|T{| Pox, | > (A)}, 
we get 
x, 2) — Yo, @)| & [LG 0) — Vo, OI. 
Then 


Eo(L(x, 0) — BE(x, 0)” = Eg(P*(x, 0) — ¥5(x, 0))’, 


which proves the statement for the direct problem. The dual problem is treated like- 
wise. 

The last theorem proves optimality for a certain class of bounds. The next one 
shows that in fact any feasible bounds can be considered. 


Theorem 3.3 Let 
G(0) > ri, 0), V(O) > vo, 4) for all 0, (3.82) 


where Y, and ¥ are defined in (3.34) and (3.36) respectively. Then the solutions to 
both the direct and the dual Hampel problems have the form (3.41) for a suitable 
Junction k(@). 


Proof of Theorem 3.3: We treat the dual problem; the direct problem is dealt with 
in the same way. 

We show first that given any k, there exists r so that Y*(x, 0) is Fisher-consistent. 
Let 


A(r) = Egyy,(Yo, 8) — 1). 


Then A is continuous, and lim,,,,,4(7) = +k. Hence by the Intermediate Value 
Theorem, there exists some r such that A(r) = 0. In addition, it can be shown that 
B(O, ¥*) 4 0. The proof is involved and can be found in Hampel et al. (1986). 

In view of (3.69): 


Eg Pay 0)” 


oy = (x, OPo(x, 2” 


(k(6))? [E, (3.83) 


Pao) 


where ‘¥* in (3.41) is written as Ey 9)) to stress its dependence on k. Recall that 


the limit cases k > 0 and k > o yield o(®, 0) (which may be infinite) and v(Yo, 8) 
respectively. Let V(0) be given and such that V(0) > v(%o, 0). Consider a fixed 0. If 
V(@) < o(, @), then there exists a value k() such that u(Pa@y 0) = V(@). If V(0) > 
o(®, 9), then putting k(6) = 0 (ie., Y* = p) minimizes 1 CPiay and satisfies 


U(P gy>9) < VO). 
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3.9 


3.1; 
3.2: 
3: 


3.4. 
3.5: 
3.6. 


3.7. 
3.8. 
3.9: 


3.10. 
3.11. 


Problems 


Verify (3.15). 
Prove (3.19). 


Verify that the breakdown points of the SD, the MAD and the IQR are 0, 1/2 
and 1/4, respectively 


Show that the asymptotic BP of the a-trimmed mean is a. 
Show that the BP of equivariant dispersion estimators is < 0.5. 


Show that the asymptotic BP of sample f-quantiles is min(f, | — f) (recall 
Problem 2.13) 


Prove (3.26). 
Verify (3.46). 
Prove (3.48). 
Prove (3.46). 


Consider the location M-estimator with Huber function y;, and the MADN as 
previously computed dispersion. Recall that it has BP = 1/2 for all k. Show 
however that for each given € < 0.5, its maximum bias MB(e) at a given dis- 
tribution is an unbounded function of k. 


. Let the density f(x) be a decreasing function of |x|. Show that the shortest 


interval covering a given probability is symmetric about 0. Use this result to 
calculate the IF of the Shorth estimator (Problem 2.16a) for data with distribu- 
tion f. 


. Show that the BP of the estimator Q,, in (2.64) is 0.5. Calculate the BP for 


the estimator defined as the median of the differences; that is, with k = m/2 in 
(2.64). 


. Show the equivalence of the direct and dual Hampel problems (3.39) and (3.40) 
. For the exponential family f(x) = I(x > 0) exp(—x/0)/0: 


(a) Show that the estimator with smallest GES is Med (x)/ In 2. 

(b) Find the asymptotic distribution of this estimator and its efficiency with 
respect to the MLE. 

(c) Find the form of the Hampel-optimal estimator for this family. 

(d) Write a program to compute the Hampel-optimal estimator with efficiency 
0.95. 


. Consider the estimator #7; defined by the one-step Newton—Raphson proce- 


dure defined in Section 2.10.5.1. Assume that the underlying distribution is 
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symmetric about y, that y is odd and differentiable, and that the initial esti- 
mator ji, is consistent for yu. 


(a) Show that 7, is consistent for y. 

(b) If y is twice differentiable, show that /7, has the same influence function 
as the M-estimator /7 defined by ave{y(x — )} = 0 (and hence, by (3.18), 
fi, has the same asymptotic variance as 2). 

(c) If y is bounded and y’(x) > 0 for all x, and the asymptotic BP of jig is 0.5, 
show that also #7; has an asymptotic BP of 0.5. 


4 


Linear Regression 1 


4.1 Introduction 


In this chapter we begin the discussion of the estimation of the parameters of linear 
regression models, which will be pursued in the next chapter. M-estimators for 
regression are developed in the same way as for location. In this chapter we deal 
with fixed (nonrandom) predictors. Recall that our estimators of choice for location 
were redescending M-estimators using the median as starting point and the MAD 
as dispersion. Redescending estimators will also be our choice for regression. When 
the predictors are fixed and fulfill certain conditions that are satisfied in particular 
for analysis of variance models, monotone M-estimators — which are easy to 
compute — are robust, and can be used as starting points to compute a redescending 
estimator. When the predictors are random, or when they are fixed but in some sense 
“unbalanced”, monotone estimators cease to be reliable, and the starting points for 
redescending estimators must be computed otherwise. This problem is considered 
in the next chapter. 

We start with an example that shows the weakness of the least-squares estimator. 


Example 4.1 = /n an experiment on the speed of learning of rats (Bond, 1979), times 
were recorded for a rat to go through a shuttlebox in successive attempts. If the 
time exceeded 5s, the rat received an electric shock for the duration of the next 
attempt. The data are the number of shocks received and the average time for all 
attempts between shocks. Tables and figures for this example can be obtained with 
script Shock.R. 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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fe} 


Average Time 


Number of Shocks 


Figure 4.1 Shock data: LS fit with all data and omitting points 1, 2 and 4 


Figure 4.1 shows the data and the straight line fitted by least squares (LS) to the 
linear regression model 


Yi = Bo + Bix; + Yj. 


The relationship between the variables is seen to be roughly linear, except for the three 
upper-left points. The LS line does not fit the bulk of the data, being a compromise 
between those three points and the rest. The figure also shows the LS fit computed 
without using the three points. It gives a better representation of the majority of the 
data, while indicating the exceptional character of points 1, 2 and 4. 

We aim to develop procedures that give a good fit to the bulk of the data without 
being perturbed by a small proportion of outliers, and that do not require deciding in 
advance which observations are outliers. 

Table 4.1 gives the estimated parameters for an LS fit, with the complete data 
and with the three atypical points deleted, and also for two robust estimators (L, and 
bisquare) that will be defined later. 


Table 4.1 Regression estimates for rats data 


Intercept Slope 
LS 10.48 —0.61 
LS (-1, 2, 4) 7.22 —0.32 
L, 8.22 —0.42 


Bisquare M-est. 7.83 —0.41 
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The LS fit of a straight line consists of finding Bo B, such that the residuals 
r= yi — By + Bx) 
satisfy 


> > = min. (4.1) 


Recall that in the location case obtained by setting 6, = 0, the solution of (2.16) is the 
sample mean; that is, the LS estimator of location is the average of the data values. 
Since the median satisfies (2.18), the regression analogue of the median, often called 
an L, estimator (also called the least absolute deviation or LAD estimator), is defined 
by 


Min = min. (4.2) 
i=1 


For our data, the solution of (4.2) is given in Table 4.1, and it can be seen that its slope 
is smaller than that of the LS estimator; that is, it is less affected by the outliers. 

Now consider the more general case of a dataset of n observations (xj, . . . , Xp, Yi) 
where x;1, . . . Xj) are predictor variables (the predictors or independent variables) and 
y; is a response variable (the response or dependent variable). The data are assumed 
to follow the linear model 


Pp 
yi = DY xyBi + uy, P= bs gh (4.3) 

j=l 
where f,,...,f, are unknown parameters to be estimated, and the u; are random 


variables (the “errors”’). In a designed experiment, the Xj; are nonrandom (or fixed); 
that is, determined before the experiment. When the data are observational the xj; are 
random variables. We sometimes have mixed situations with both fixed and random 
predictors. 

Denoting by x; and # the p-dimensional column vectors with coordinates 


(Kijsaiis Xin) and (fj, ..., B,) respectively, the model can be more compactly written as 
Y= XB tu; (4.4) 


where x’ is the transpose of x. In the common case where the model has a constant 
term, the first coordinate of each x; is 1 and the model may be written as 


Yi = Bo + XB, +4; (4.5) 


= ! in Reo! 
where X; = (Xj1,---,Xjg-—1))’ and B, are in R?~ and 


<- (2) (8) “ 
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Here fp is called the intercept and the elements of f, are the slopes. Call X the 
n X p matrix with elements x;;, and let y and u be the vectors with elements y; and u; 
respectively (i = 1,...,). Then the linear model (4.4) may be written 


y=Xf+u. (4.7) 


The fitted values ¥, and the residuals r, corresponding to a vector B are defined 
respectively as 


3B) = xB and r;(B) = y; — 3B) . 


The dependence of the fitted values and residuals on 6 will be dropped when this 
does not cause confusion. In order to combine robustness and efficiency, along the 
lines of Chapter 2, we shall discuss regression M-estimators f, defined as solutions 


of equations of the form 
n r. a 
>, p ( 2) = min. (4.8) 
o 


i=1 


Here p is a p-function (Definition 2.1), and G is an auxiliary scale estimator that is 
required to make B scale equivariant (see (2.5) and (4.16)). The LS estimator and the 
L, estimator correspond respectively to p(t) = #7 and p(f) = |t|. In these two cases 6 
becomes a constant factor outside the summation sign and minimizing (4.8) is equiv- 
alent to minimizing }""_, r7 or )};_;|r;|, respectively. Thus neither the LS nor the L, 
estimators require a scale estimator. 

In a designed experiment, the predictors x,; are fixed. An important special case 
of fixed predictors is when they represent categorical predictors with values of either 
0 or 1. The simplest situation is the comparison of several treatments, usually called 
a one-way analysis of variance (or “one-way ANOVA”). Here we have p samples yj, 


(@=1,...,m,k = 1,...,p) and the model 
Vig = By + Uix (4.9) 


where the uj, are i.i.d. Call 1,, the column vector of m ones. Then the matrix X of 
predictors is 


Np 


with the blank positions filled with zeros. The next level of model complexity is 
a factorial design with two factors represented by two categorical variables. This 
is usually called a two-way analysis of variance. In this case we have data y;;,, 
Pose gd ef SNe ckd ck Mla ,K;, following an additive model usually written 
in the form 

Vie = HAO; +7; + Uy (4.10) 
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with “cells” i,j, and K;; observations per cell. Here p = 1 + J + | and f has coordi- 
nates (YU, @),...,),7,,---,Y,). The rank of X is p* = 1+ J —1 < p and constraints 
on the parameters need to be added to make the estimators unique, typically 


I J 
a= Ly=0. (4.11) 


4.2 Review of the least squares method 


The LS method was proposed in 1805 by Legendre (for a fascinating account, see 
Stigler (1986)). The main reason for its immediate and lasting success was that it 
was the only method of estimation that could be effectively computed before the 
advent of electronic computers. We shall review the main properties of LS for mul- 
tiple regression. (See any standard text on regression analysis for further details, for 
example Weisberg (1985), Draper and Smith (2001), Montgomery ef al. (2001) or 
Stapleton (1995).) The LS estimator of B is the B such that 


n 


y r>(B) = min. (4.12) 


i=l 
Differentiating with respect to B yields 


n 


Y Bx; = 9, (4.13) 


i=l 
which is equivalent to the linear equations 
X’/XB = X’y. 


The above equations are usually called the “normal equations”’. If the model contains 
a constant term, it follows from (4.13) that the residuals have zero average. 

The matrix of predictors X is said to have full rank if its columns are linearly 
independent. This is equivalent to 


Xa#0Va40 


and also equivalent to the nonsingularity of X’X. If X has full rank then the solution 
of (4.13) is unique and is given by 


Brs = Brs(X sy) = (X'X)!X’y. (4.14) 


If the model contains a constant term, then the first column of X is identically one, 
and the full-rank condition implies that no other column is constant. If X is not of 
full rank, for example as in (4.10), then we have what is called collinearity. When 
there is collinearity, the parameters are not identifiable in the sense that there exist 
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B, # By such that Xf; = XB, which implies that (4.13) has infinite solutions, all 
yielding the same fitted values and hence the same residuals. 
The LS estimator satisfies 


Bis(X,y + Xv) =Bys(X,y) +7 for all y € R? (4.15) 
Bis(X yAy) = AB s(X »y) forall AER (4.16) 

and for all nonsingular p x p-matrices A 
Bis(XA,y)=A'B,s(X,y). (4.17) 


The properties (4.15), (4.16) and (4.17) are called respectively regression, scale and 
affine equivariance. These are desirable properties, since they allow us to know how 
the estimator changes under these transformations of the data. A more precise justi- 
fication is given in Section 4.9.1. 

Assume now that the wu; are i.i.d. with 


Eu;=0 and = Var(u;) = 07 


and that X is fixed; that is, nonrandom and of full rank. Under the linear model (4.4) 
with X of full rank, 6; is unbiased and its mean and covariance matrix are given by 


EBs = B, Var(B,5) = 0°(X'X)"! (4.18) 


where henceforth Var(y) will denote the covariance matrix of the random vector y. 
Under model (4.5) we have the decomposition 


1. Sole _ce-lpy 
(X’X)-! = | an eal = a x) 
= @: x Cc 1 
where 
X = ave(x,), C= )'(«%, - Ox, -¥ (4.19) 
i=1 
and hence 
Var(B 1s) = 07°C}. (4.20) 


If Eu; 4 0, then B zs Will be biased. However, if the model contains an intercept, 
the bias will only affect the intercept and not the slopes. More precisely, under (4.5) 


Epix =p, (4.21) 


although E Bux # By (see Section 4.9.2 for details). 
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Let p* be the rank of X and recall that if p* < p — that is, if X is collinear — then 
Bs is not uniquely defined but all solutions of (4.13) yield the same residuals. Then 
an unbiased estimator of o7 is defined by 


1 n 
2 2 
Vee i. (4.22) 
n— p* 2 
whether or not X is of full rank. 

If the u; are normal and X is of full rank, then B;, is multivariate normal 


Bis ~ N,(B,07(X’X)"}), (4.23) 


where N,(#, %) denotes the p-variate normal distribution with mean vector w and 
covariance matrix &. 

Let y now be a linear combination of the parameters: y = B’a with a a con- 
stant vector. Then the natural estimator of y is ¥ = B a, which according to (4.23) is 
NY, o*), with 

o, =o°a'(X’X) 'a. 


An unbiased estimator of o; 1s 
6, = sal (X’X)!a. (4.24) 


Confidence intervals and tests for y may be obtained from the fact that under 
normality the “t-statistic” 
= Vey 


Az 
Oy 


T 


(4.25) 


has a f-distribution with n — p* degrees of freedom, where p* = rank(X). In particular, 
a confidence upper bound and a two-sided confidence interval for y with level | — a 
are given by 


7 + Gyty_p* 1a and (7 = Oy tn—p*,1—a/2> 7 + Gy ty—p*,1—a/2) (4.26) 


where f,, 5 is the 6-quantile of a f-distribution with n degrees of freedom. Similarly, the 
tests of level a for the null hypothesis Hy : y = 7g against the two-sided alternative 
y # Yo and the one-sided alternative y > yg have the rejection regions 


\7 = Yol > Oyty—p* 1—a/2 and 7 Pe Yo = a re (4.27) 


respectively. 

If the u; are not normal but have a finite variance, then for large n it can be shown 
using the central limit theorem that Bis is approximately normal, with parameters 
given by (4.18), provided that: 


none of the x; is “much larger” than the rest. (4.28) 
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This condition is formalized in (10.33) in Section 10.10.1. Recall that for large n 
the quantiles ¢,, of the t-distribution converge to the quantiles z, of N(0, 1). For 
the large-sample theory of the LS estimator, see Stapleton (1995) and Huber and 
Ronchetti (2009, p. 157). 


4.3 Classical methods for outlier detection 


The most popular way to deal with regression outliers is to use LS and try to find the 
influential observations. After they have been identified, a decision must be taken, 
for example modifying or deleting them and applying LS to the modified data. Many 
numerical and/or graphical procedures — so-called regression diagnostics — are avail- 
able for detecting influential observations based on an initial LS fit. They include the 
familiar Q—Q plots of residuals, and plots of residuals against fitted values. See Weis- 
berg (1985), Belsley et al. (1980) or Chatterjee and Hadi (1988) for further details on 
these methods, as well as for proofs of the statements in this section. 

The influence of one observation z; = (x;, y;) on the LS estimator depends both on 
y; being too large or too small compared to instances of y from similar instances of x, 
and on how “large” x; is; that is, how much leverage x; has. Most popular diagnostics 
for measuring the influence of z; = (x;, y;) are based on comparing the LS estimator 
based on the full data with LS when z; is omitted. Call B and Bs the LS estimates 
based on the full data and on the data without z,, and let 


¥=XB, Jy = XB 


where r; = r(B). Note that if p* <p, then Be is not unique, but Jui) is unique. Then 
the Cook distance of 7; is 


lax ro) 
D= re l¥o - yl 


where p* = rank(X) and G is the residual standard deviation estimator 


1 n 
oP 
n~ Pp’ j=l 


Call H the matrix of the orthogonal projection on the image of X; that is, on the 
subspace {Xf : B € R?}. The matrix H is the so-called “hat matrix” and its diagonal 
elements /1,,...,h,, are the leverages of X,,...,x,. If p* = p, then H fulfills 


H=X(X/X)'X’ sand sh, = x) (X/X)' x. (4.29) 
The h; satisfy 


Yh; =p*, hy € (0,11. (4.30) 
i=] 
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It can be shown that the Cook distance is easily computed in terms of the h;: 


pe (4.31) 
‘pe hy?” 
It follows from (4.31) that observations with high leverage are more influential than 
observations with low leverage having the same residuals. 
When the regression has an intercept, 


h, = - + (x, - x) (X" X*) 1x, — x) (4.32) 
where x is the average of the x ;s and X* is the n x (p — 1) matrix whose ith row is 
(x, — x)’. In this case ; is a measure of how far x; is from the average value x. 

Calculating h; does not always require the explicit computation of H. For 
example, in the case of the two-way design (4.10), it follows from the symmetry of 
the design that all the h; are equal, and then (4.30) yields 


fx die Gs | 
‘on a 
While D; can detect outliers in simple situations, it fails for more complex con- 
figurations and may even fail to recognize a single outlier. The reason is that r;, h; 
and s may be greatly influenced by the outlier. It is safer to use statistics based on the 
“Jeave-one-out” approach, as follows. The leave-one-out residual rj, = y; — Pax is 
known to be expressible as 


a (4.33) 
to. = -_—. . 
1G) 
D  1-h, 
and it is shown in the above references that 
2 
o 
Var(1;)) = ———. 
( (i)) t= h; 
An estimator of o7 that is free of the influence of x; is the quantity Si which is 


defined like 5”, but has the ith observation deleted from the sample. It is also shown 
in the references above that 


a 
O- n—p*-1 


? 
— n¥\o2 — U 
G Dp )s Toh | - (4.34) 


and a Studentized version of r,) is given by 


Vc. tT: 
S(i) V1 —A; * 
Under the normal distribution model, ti) has a t-distribution with n — | degrees of 
freedom. Then a test of outlyingness with significance level a@ is to decide that the ith 


(4.35) 
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observation is an outlier if |f(| > t,-1,4-a)/2- A graphical analysis is provided by the 
normal Q-Q plot of (;). 

While the above “complete” leave-one-out approach ensures the detection of an 
isolated outlier, it can still be fooled by the combined action of several outliers, an 
effect that is referred to as masking. 


Example 4.2 The data in Table 4.2 (Scheffé 1959, p. 138) are the yields of grain for 
eight varieties of oats in five replications of a randomized-block experiment. Tables 
and figures for this example can be obtained with script oats.R. 


Fitting (4.10) by LS yields residuals with no noticeable structure, and the usual 
F-tests for row and column effects have highly significant p-values of 0.00002 and 
0.001, respectively. To show the effect of outliers on the classical procedure, we have 
modified five data values. Table 4.3 shows the data with the five altered values in 
boldface. Figure 4.2 shows the normal Q—-Q plot of 4; for the altered data. Again, 
nothing suspicious appears. But the p-values of the F-tests are now 0.13 and 0.04, 
the first of which is quite insignificant and the second of which is barely significant 
at the liberal 0.05 level. The diagnostics have thus failed to indicate a departure from 
the model, with serious consequences. 

There is a vast literature on regression diagnostics. These procedures are fast, and 
are much better than naively fitting LS without further care. But they are inferior to 
robust methods in several senses: 


they may fail in the presence of masking; 

the distribution of the resulting estimator is unknown; 

the variability may be underestimated; 

once an outlier is found further ones may appear, and it is not clear when one 
should stop. 


Table 4.2 Oats data 


Block 

Variety I Il Ill IV Vv 

1 296 357 340 331 348 
2 402 390 431 340 320 
3 437 334 426 320 296 
4 303 319 310 260 242 
5 469 405 442 487 394 
6 345 342 358 300 308 
7 324 339 357 352 230 
8 488 374 401 338 320 
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Table 4.3. Modified oats data 


Block 
Variety I I Ill IV V 
1 476 357 340 331 348 
2 402 390 431 340 320 
3 437 334 426 320 296 
4 303 319 310 260 382 
5 469 405 442 287 394 
6 345 342 358 300 308 
7 324 339 357 352 410 
8 288 374 401 338 320 
S ° 
r 4 ae ° 
3B 669° 
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Quantiles of Standard Normal 
Figure 4.2 Altered oats data: Q—Q plot of LS residuals 


4.4 Regression M-estimators 


As in Section 2.3 we shall now develop estimators combining robustness and 
efficiency. Assume model (4.4) with fixed X where u; has a density 


1 u 

(5): 

o \o 
where o is a scale parameter. For the linear model (4.4) the y; are independent but not 
identically distributed, y; has density 


1, (24) 
oO oO 
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and the likelihood function for 6 assuming a fixed value of o is 


n y;—X,B 
un= LTA (” - ) 
i=1 


Calculating the MLE means maximizing L(B), which is equivalent to finding B such 


that i 
lx r(B) ; 
es os Ino = ; 4.36 
; 2 Po ( - ) + Ino = min ( ) 


where py = — Info, as in (2.14). We shall deal with estimators defined by (4.36). Con- 
tinuing to assume o is known, and differentiating with respect to 6, we have the 
analogue of the normal equations: 


vv (2) x, = 0, (4.37) 


i=1 


where yy = Po = =f, /fo. If fp is the standard normal then B is the LS estimator (4.12), 
and if f) is the double exponential density then B satisfies 


y || = min 
i=1 


and B is called an L, estimator, which is the regression equivalent of the median. It 
is remarkable that this estimator was studied before LS (by Boscovich in 1757 and 
Laplace in 1799). Differentiating the likelihood function in this case gives 


by sgn(r,(B))x; = 0 (4.38) 
i=l 
where “sgn” denotes the sign function (2.20). If the model contains an intercept term, 
(4.38) implies that the residuals have zero median. 

Unlike LS there are in general no explicit expressions for an L, estimator. How- 
ever, there exist very fast algorithms to compute it (Barrodale and Roberts, 1973; 
Portnoy and Koenker, 1997). We note also that an L, estimator B may not be unique, 
and it has the property that at least p residuals are zero (Bloomfield and Staiger, 1983). 

We define regression M-estimators as solutions B to 


vp ( iB ) = min (4.39) 
(oy 


i=l 


where G is an error scale estimator. Differentiating (4.39) yields the equation 


dv (2 )s=0 (4.40) 
(oy 


i=1 
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where y = p’. The last equation need not be the estimating equation of an MLE. In 
most situations considered in this chapter, is computed previously, but it can also 
be computed simultaneously through a scale M-estimating equation. 

It will henceforth be assumed that p and y are respectively a p- and a y-function in 
the sense of Definitions 2.1 and 2.2. The matrix X will be assumed to have full rank. In 
the special case where o is assumed known, the reader may verify that the estimators 
are regression and affine equivariant (see Problem 4.1). The case of estimated o is 
considered in Section 4.4.2. 

Solutions to (4.40) with monotone (resp. redescending) y are called monotone 
(resp. redescending) regression M-estimators. The main advantage of monotone 
estimators is that all solutions of (4.40) are solutions of (4.39). Furthermore, if 
y is increasing then the solution is unique (see Theorem 10.15). The example 
in Section 2.8.1 showed that in the case of redescending location estimators, the 
estimating equation may have “bad” roots. This cannot happen with monotone 
estimators. On the other hand, we have seen in Section 3.4 that redescending 
M-estimators of location yield a better trade-off between robustness and efficiency, 
and the same can be shown to hold in the regression context. Computing redescend- 
ing estimators requires a starting point, and this will be the main role of monotone 
estimators. This matter is pursued further in Section 4.4.2. 


4.4.1 M-estimators with known scale 


Assume model (4.4) with u such that 
u 
Ey (<) =0 (4.41) 
o 


which holds in particular if u is symmetric. Then, if (4.28) holds, B is consistent for 
B in the sense that 


B=, B (4.42) 
when n — oo, and furthermore for large n 
D(B) ~N,(B, v(X’X)~') (4.43) 
where vu is the same as in (2.65): 


2 
v= 2 Ewtu/oy (4.44) 
(Ey'(u/o))? 
A general proof is given by Yohai and Maronna (1979). 
Thus the approximate covariance matrix of an M-estimator differs only by a con- 
stant factor from that of the LS estimator. Hence its efficiency for normal u does not 
depend on X; that is, 


o 


Eff(B) = (4.45) 


e1G 
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where v is given by (4.44) with the expectations computed for u ~ N (0, 05): It is easy 
to see that the efficiency does not depend on op. 

It is important to note that if we have a model with intercept (4.5), and (4.41) 
does not hold, then the intercept is asymptotically biased, but the slope estimators are 
nonetheless consistent (see Section 4.9.2): 


B, >, Bi. (4.46) 


4.4.2 M-estimators with preliminary scale 


When estimating location with an M-estimator in Section 2.7.1, we estimated o using 
the MAD. Here, the equivalent procedure is first to compute the L, fit and from it 
obtain the analog of the normalized MAD by taking the median of the nonnull abso- 
lute residuals: 
o= ag Med(lrlr #0). (4.47) 

The reason for using only nonnull residuals is that since at least p residuals are null, 
including all residuals when p is large could lead to underestimating o. Recall that 
the L, estimator does not require estimating a scale. 

Write G in (4.47) as G(X, y). Then, since the L, estimator is regression, scale and 
affine equivariant, it is easy to show that 


o(X,y + Xy) = o(X,y), o(XA,y) = o(X,y), o(X, Ay) = |Alo(X,y) (4.48) 


for ally € R?, nonsingular A € R?*? and A € R. We say that G is regression and affine 
invariant and scale equivariant. 

We then obtain a regression M-estimator by solving (4.39) or (4.40) with ¢ 
instead of o. Then (4.48) implies that B is regression, affine and scale equivariant 
(Problem 4.2). 

Assume that yo and that (4.41) holds. Under (4.4) we would expect that for 
large n the distribution of B is approximated by (4.43) and (4.44); that is, that 6 can 
be replaced by o. Since y is odd, this holds in general if the distribution of u; is 
symmetric. Thus the efficiency of the estimator does not depend on X. 

If the model contains an intercept, the approximate distribution result holds for 
the slopes without any requirement on u;. More precisely, assume model (4.5). Then 
Bi i is approximately normal, with mean B, and covariance matrix vC7!, with v given 
by (4.44) and C defined in (4.19) (see Section 10.10.1 for a heuristic priot); 

We can estimate v in (4.44) as 

g=32 eV?) } oon (4.49) 
[ave;{y'(r;/o)}]* n — 
where the denominator n — p appears for the same reasons as in (4.22). Hence for 
large n we may treat B as approximately normal: 


D(B) © N,(B, O(X'X)"). (4.50) 
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Thus we can proceed as in (4.24)-(4.27), but replacing s* in (4.24) by the estimator 
0 above so that 6? = da’(X’X)~!a, to obtain approximate confidence intervals and 


tests. In the case of intervals and tests for a single coefficient f; we have 
a2 _ ayly)-! 
On. = (XX); 


where the subscripts ii mean the ith diagonal element of matrix (X’X)~!. 

As we have seen in the location case, one important advantage of redescending 
estimators is that they give null weight to large residuals, which implies the possi- 
bility of a high efficiency for both normal and heavy-tailed data. This is also true for 
regression, since the efficiency depends only on v, which is the same as for location. 
Therefore our recommended procedure is to use L, as a basis for computing o and 
as a Starting point for the iterative computing of a bisquare M-estimator. 


Example 4.1 (continuation) The slope and intercept values for the bisquare 
M-estimator with 0.85 efficiency are shown in Table 4.1, along with those of the 
LS estimator using the full data, the LS estimator computed without the points 
labeled 1, 2, and 4, and the L, estimator. The corresponding fitted lines are shown in 
Figure 4.3. The results are very similar to the LS estimator computed without the 
three atypical points. 

The estimated standard deviations of the slope are 0.122 for LS and 0.050 for 
the bisquare M-estimator, and the respective confidence intervals with level 0.95 
are (—0.849, —0.371) and (—0.580, —0.384). It is seen that the outliers inflate the 
confidence interval based on the LS estimator relative to that based on the bisquare 
M-estimator. 
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Figure 4.3. Rats data: fits by least squares (LS), L,, bisquare M-estimator (M) and 
least squares with outliers omitted (LS—) 
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Example 4.2 (continuation) Tables and figures for this example are obtained 
with script oats.R. Figure 4.4 shows the residual Q—Q plot based on the bisquare 
M-estimator and it is seen that the five modified values stand out from the rest. 

Table 4.4 gives the p-values of the robust likelihood ratio-type test to be described 
in Section 4.7.2 for row and column effects. Values are shown for the original and 
the altered data, together with those of the classical F-test already given. 

We see that the M-estimator results for the altered data are quite close to those for 
the original data. Furthermore, for the altered data the robust test again gives strong 
evidence of row and column effects. 


4.4.3 Simultaneous estimation of regression and scale 


Another approach to deal with the estimation of o is to proceed as in Section 2.7.2, 
namely to add to the estimating equation (4.40) for 6 an M-estimating equation for 


o, resulting in the system 
n r 4 
De( P20, (4.51) 
o 


i=1 


o 4 °20 
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Figure 4.4 Altered oats data: normal Q-Q plot of residuals from M-estimator 


Table 4.4 Oats data: p-values of tests 


Rows Columns 
F Robust F Robust 
Original 1.56 x 10~> 17x10" 0.001 2.6x 10° 


Altered 0.13 4x10 0.04 1.7x 10-4 
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ly ri(B)\ _ 
i 2 Pycale () = 6, (4.52) 


where p,.aic 18 a p-function. Note that differentiating (4.36) with respect to B and o 
yields a system of the form (4.51)-(4.52), with p,..j¢ given in (2.73). Therefore this 
class of estimators includes the MLE. 

Simultaneous estimators with monotonic y are less robust than those of the for- 
mer Section 4.4.2 (recall Section 3.2.4 for the location case), but they will be used 
with redescending wy in another context in Section 5.4.1. 


4.5 Numerical computing of monotone M-estimators 


4.5.1 The L, estimator 


As was mentioned above, computing the L, estimator requires sophisticated algo- 
rithms, such as the one due to Barrodale and Roberts (1973). There are, however, 
some cases in which this estimator can be computed explicitly. For regression 
through the origin (y; = fx; + u;), the reader can verify that B is a “weighted median” 
(Problem 4.4). For one-way ANOVA (4.9) we immediately have that B, = Med,(y;,). 
And for two-way ANOVA with one observation per cell (i.e., (4.10)-(4.11) with 
K;; = 1), there is a simple method that we now describe. 

Let yj=M+a;+y;t+u,. Then differentiating >); dijlyy—“— a —y;| with 
respect to 1, a; and y;, and recalling that the derivative of |x| is sgn(x), it follows that 
(4.38) is equivalent to 


Med; (7) = Med,(7;;) — Med,(7;;) =0 forall i,j (4.53) 


where rj; = yj — fi- a; - Yj. These equations suggest an iterative procedure due to 
Tukey (1977), known as “median polish’, which goes as follows (where “a <— b” 


stands for “replace a by b”): 


1. Put @; = 7; = 0 fori=1,...,/ andj = 1,...,/, and # = 0, and hence rig = Viz- 
2. For i= 1,...,/: let 6; = Med(r,). Update a, — @, + 6, and Fy Fy Oy 

3. For j = 1,...,J : let 6; = Med,(r;;). Update 7; — 7 + 6; and rj, — rj — 6. 

4. Repeat steps 2-3 until no more changes take place. 

5. Puta =I"! )),@; and b = J!) 7, and @ — @;-— 4,7; —7,-b, fi cath. 


If J or J is even, the median must be understood as the “high” or “low” median 
(Section 1.2), otherwise the procedure may oscillate indefinitely. 

It can be shown (Problem 4.5) that the sum of absolute residuals 
J 

y- 2-8-9, 
i=l j= 


_ 
i 
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decreases at each step of the algorithm. The result frequently coincides with an L, 
estimator, and is otherwise generally close to it. Sposito (1987) gives conditions under 
which the median polish coincides with the L, estimator. 


4.5.2 M-estimators with smooth y-function 


In the case of a smooth y-function, one can solve (4.37) using an iterative reweighting 
method similar to that of Section 2.8.Define W as in (2.31), and then with o replaced 
by 6, the M-estimator equation (4.37) for B may be written as 


> winx; = Y wx; - x/B) = 0 (4.54) 
i=1 i=1 


with w; = W(r,/G). These are “weighted normal equations”, and if the w; were 
known, the equations could be solved by applying LS to Jw; and Wixi. But the 
w, are not known and depend upon the data. So the procedure, which depends on a 
tolerance parameter €, is 


1. Compute an initial L, estimator Bo and compute 6 from (4.47). 
2. Fork =0,1,2,...: 


(a) Given B,, for i = 1, ...,n compute Viket = Jim xB; and w; x4) = Wr441/6)- 
(b) Compute By. 41 by solving 


n 
py W; .X;(; — X;B) = 0. 
i=l 


3. Stop when max;(|7;4 — rip41)/6 < €. 


This algorithm converges if W(x) is nonincreasing for x > 0 (Section 9.1). If w 
is monotone, since the solution is essentially unique, the choice of the starting point 
influences the number of iterations but not the final result. This procedure is called 
“iteratively reweighted least squares” (IRWLS). 

For simultaneous estimation of 6 and o, the procedure is the same, except that at 
each iteration G is also updated as in (2.80). 


4.6 BP of monotone regression estimators 


In this section we discuss the breakdown point of monotone estimators for nonrandom 

predictors. Assume X is of full rank so that the estimators are well defined. Since X 

is fixed, only y can be changed, and this requires a modification of the definition of 

the breakdown point (BP). The FBP for regression with fixed predictors is defined as 
me 


Ee = > 
n 
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with 
m* =max{m>0: BCX, y,,) bounded V y,, € V,,}, (4.55) 


where J,, is the set of n—vectors with at least n — m elements in common with y. It 
is clear that the LS estimator has e* = 0. 
Let k* = k*(X) be the maximum number of x; lying on the same subspace of 


dimension < p: 
k*(X) = max{#(B’x; = 0) : BER’, B #0} (4.56) 


where a subspace of dimension 0 is the set {0}. In the case of simple straight-line 

regression, k* is the maximum number of repeated x;. We have k* > p — 1 always. If 

k* = p — 1 then X is said to be in general position. In the case of a model with inter- 

cept (4.6), X is in general position iff no more than p — 1 of the x; lie on a hyperplane. 
It is shown in Section 4.9.3 that for all regression equivariant estimators 


ae ae (4.57) 
where ‘. ; 
a n- 
Mnax = i 2 | = | al eo) 


In the location case, k* = 0 and m;,,,/n becomes (3.26). The FBP of monotone 
M-estimators is given in Section 4.9.4. For the one-way design (4.9) and the two-way 
design (4.10), it can be shown that the FBP of monotone M-estimators attains the 
maximum (4.57) (see Section 4.9.3). In the first case 


J 


min,n; -1 
= 


i =e = (4.59) 
and so if at least half of the elements of the smallest sample are outliers then one of 
the #; is unbounded. In the second case 


5 (4.60) 


. yo — J)- _ 
and so if at least half of the elements of a row or column are outliers then at least one 
of the estimators 7, @; or 7; breaks down. It is natural to conjecture that the FBP of 
monotone M-estimators attains the maximum (4.57) for all X such that Xi is either 0 
or 1, but no general proof is known. 

For designs that are not zero—one designs, the FBP of M-estimators will in general 
be lower than €%,,,. This may happen even when there are no leverage points. For 
example, in the case of a uniform design x; = i,i = 1,...,n, for the fitting of a straight 
line through the origin, we have k* = 1 and hence e*,,, ~ 1/2, while for large n it 
can be shown that e* 0.3 (see Section 4.9.4). The situation is worse for fitting a 
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polynomial (Problem 4.7). It is even worse when there are leverage points. Consider 
for instance the design 


=i for i=1,...,10, x; = 100. (4.61) 


Then it can be shown that m* = 0 for a linear fit (Problem 4.8). The intuitive reason 
for this fact is that here the estimator is determined almost solely by y,,. 

As aconsequence, monotone M-estimators can be recommended as initial estima- 
tors for zero—one designs, and perhaps also for uniform designs, but not for designs 
where X has leverage points. The case of random X will be examined in the next 
chapter. The techniques discussed there will also be applicable to fixed designs with 
leverage points. 


4.7 Robust tests for linear hypothesis 


Regression M-estimators can be used to obtain robust approximate confidence 
intervals and tests for a single linear combination of the parameters. Define 
G; as in (4.24), but with s? replaced by 0, as defined in (4.49). Then the tests and 
intervals are of the form (4.26)-(4.27). We shall now extend the theory to inference 
for several linear combinations of the £; represented by the vector y = Af, where A 
is aq X p matrix of rank q. 


4.7.1 Review of the classical theory 


To simplify the exposition, it will be assumed that X has full rank; that is, p* = p, but 
the results can be shown to hold for general p*. Assume normally distributed errors 
and let ¥ = AB, where f is the LS estimator. Then ¥ ~ N(y, Z,) where 

ZX, = 0° A(X’X) 1A’, 
An estimator of Z, is given by 

= sA(X’X)-14/. (4.62) 

A yer 

It is proved in standard regression textbooks that (y— 2, (y¥—Y)'/q has an 


F-distribution with g and n—p* degrees of freedom, and hence a confidence 
ellipsoid for y of level 1 — a is given by 


{12 @-VWE, @-V Say - ao}, 


where F ; 


ny Ng 


freedom. 


(6) is the 6-quantile of an F-distribution with n, and n, degrees of 
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We consider testing the linear hypothesis Hy : y= Yo for a given Yo, with level 
a. The so-called Wald-type test rejects Hy when Y, does not belong to the confidence 
ellipsoid, and hence has rejection region 


TS Fg. =@) (4.63) 


with 1 - 
T= a — Yo) Ly Y— Yo) (4.64) 


It is also shown in standard texts, such as Scheffé (1959), that the statistic T can be 
written in the form 
T= (Sp — S)/q 


= 4.65 
S/(n— p*) oo! 


where P 


=F (). = Se () 


i=1 


and where Br is the LS estimator with the restriction y = AB = Yo. It is also shown 
that the test based on (4.65) coincides with the likelihood ratio test (LRT). We can 
also write the test statistic T (4.65) as 
(Se - S*) 
T = ——_ (4.66) 
q 


n ~~ 2 n a 2 
s=3 (2) , 5-5 (2) (4.67) 


i=1 i=1 


where 


The most common application of these tests is when Hp is the hypothesis that 
some of the coefficients f; are zero. We may assume, without loss of generality, that 
the hypothesis is 


Hy = {8 = fy =... = B, = 9} 


which can be written as Hy : 4 = AB = OwithA = (J, 0), where Lis the g x g identity 
matrix and 0 is a (p — q) X p matrix with all its elements zero. 
When q = 1, we have y = a’B with a € R? and then the variance of 7 is esti- 
mated by 
6, = G’a'(X’X) Ia. 


In this special case the Wald test (4.64) simplifies to 


= 2 
r-(28) 
Oy 


and is equivalent to the two-sided test in (4.27). 
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When the errors u; are not normal, but the conditions for the asymptotic normality 
of B given at the end of Section 4.2 hold, the test and confidence regions given in this 
section will still be approximately valid for large n. For this case, recall that if T has 
an F(q, m) distribution then when m > oo, qT> 4 2 


4.7.2 Robust tests using M-estimators 


Let B now be an M-estimator, and let =.= 0(X’X)~! be the estimator of its covari- 
ance matrix, with 0 defined as in (4.49). Let 


7=Ap, B= AZ, A! = DA(X'X)1A’, 
Then a robust “Wald-type test” is defined by the rejection region 
{Tw > Fyne = a)} 


with Ty equal to the right-hand side of (4.64), but the classical quantities there are 
replaced by the above robust estimators ¥ and Ly. 
Let B rp be the M-estimator computed with the restriction that y = y9: 


ps = arg me { De () ; so=r} 


i=1 


A “likelihood ratio-type test” (LRTT) could be defined by the region 
{I> Fog. (1 —@)}, 


with T equal to the right-hand side of (4.66), but where the residuals in (4.67) cor- 
respond to an M-estimator B. But this test would not be robust, since outliers in the 
observations y; would result in corresponding residual outliers and hence an overdue 
influence on the test statistic. 

A robust LRTT can instead be defined by the statistic 


— 3 (2) -¥o(P) 


with a bounded p. Let 


_ Ey'(u/o) 
~ Ey(u/o? 
Then it can be shown (see Hampel et al., 1986) that under adequate regularity con- 


ditions, 7; converges in distribution under Hp to a chi-squared distribution with q 
degrees of freedom. Since € can be estimated by 


ave; { y'(n(B)/@)} 
ave; { wri(y/aP 


§ 


é= 
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an approximate LRTT for large n has rejection region 
eT, > 771-4), 


where y2(5) denotes the 5-quantile of the chi-squared distribution with n degrees of 
freedom. 

Wald-type tests have the drawback of being based on X’X, which may affect 
the robustness of the test when there are high-leverage points. This makes LRTTs 
preferable. The influence of high-leverage points on inference is discussed further in 
Section 5.6. 


4.8 *Regression quantiles 


Let, for a € (0, 1), 


(x) = ax if x>0 
Pa ~)_d—ayx if x<0. 


Then it is easy to show (Problem 2.13) that the solution of 


n 


>, Pai — #) = min 


i=] 


is the sample a-quantile. In the same way, the solution of 
Ep,(y — #) = min 


is an a-quantile of the random variable y. 
Koenker and Bassett (1978) extended this concept to regression, defining the 
regression a-quantile as the solution f of 


n 


Y Pali - x! B) = min. (4.68) 


i=1 


The case a = 0.5 corresponds to the L, estimator. Assume the model 
Jie= xB, + Ui, 


where the x; are fixed and the a-quantile of u; is zero; this is equivalent to assuming 
that the a-quantile of y; is x B,. Then B is an estimator of B,. 

Regression quantiles are especially useful with heteroskedastic data. Assume the 
usual situation when the model contains a constant term. If the u; are identically dis- 
tributed, then the f, for different a differ only in the intercept, and hence regression 
quantiles do not give much useful information. But if the uv; have different variability, 
then the f, will also have different slopes. 
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If the model is correct, one would like to have for a; < a, that Xp Bu < Pc for 
all Xp in the range of the data. But this cannot be mathematically ensured. Although 
this fact may be taken as an indication of model failure, it is better to ensure it from the 
start. Methods for avoiding the “crossing” of regression quantiles have been proposed 
by He (1997) and Zhao (2000). 

There is a very large literature on regression quantiles; see Koenker et al. (2005) 
for references. 


4.9 Appendix: Proofs and complements 


4.9.1 Why equivariance? 


In this section we want to explain why equivariance is a desirable property for a 
regression estimator. Let y verify the model (4.7). Here B is the vector of model 
parameters. If we put for some vector y 


y =y +X; (4.69) 
then y* = X(6 + y) +, so that y* verifies the model with parameter vector 
BY =B+Y. (4.70) 


If p= B(X, y) is an estimator, it would be desirable that if the data were transformed 

according to > (4. 69), the estimator would also transform according to (4.70); that is, 

BX, y)= BX, y) +7, which corresponds to regression equivariance (4.15). 
Likewise, if X* = XA for some matrix A, then y verifies the model 


y=(X*A7!)B+u=X*(A'f) +, 


which is (4.7) with X replaced by X* and fp by Av'B. Again, it is desirable that esti- 
mators transform the same way; that is, BX’, y) =A"! BX, y), which corresponds to 
affine equivariance (4.17). Scale equivariance (4.16) is dealt with in the same manner. 

It must be noted that although equivariance is desirable, it must sometimes be sac- 
rificed for other properties, such as a lower prediction error. In particular, the estima- 
tors resulting from a procedure for variable selection considered in Section 5.6.2 are 
neither regression nor affine equivariant. The same thing happens in general with 
procedures for dealing with a large number of variables, such as ridge regression or 
least-angle regression (Hastie et al., 2001). 


4.9.2 Consistency of estimated slopes under asymmetric errors 
We shall first prove (4.21). Let a = Eu;. Then (4.5) may be rewritten as 
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where 
uy =u;—a, Py =fyo+a@. (4.72) 


Since Eu? = 0, the LS estimator is unbiased for the parameters, which means that 


E(B,) = B, and E(Bo) = f,, so that only the intercept will be biased. 
We now prove (4.46) along the same lines. Let a be such that 


Ey (+—) =0. 


Oo 


Then reexpressing the model as (4.71)-(4.72), since Ey(u;/o) = 0, we may apply 
(4.42), and hence ~ 2 
Po > Bo> By 5 Pits 


which implies that the estimator of the slopes is consistent, although that of the inter- 
cept may be inconsistent. 


4.9.3 Maximum FBP of equivariant estimators 


The definition of the FBP in Section 4.6 can be modified to include the case of 
rank(X) < p. Since in this case there exists 9 #0 such that XO=0, (4.56) is 
modified as 

k*(X) = max{#(0’x; = 0) : 0 R’, XO 40}. (4.73) 


If rank(X) < p, there are infinite solutions to the equations, but all of them yield 
the same fit, XP. We thus modify (4.55) with the requirement that the fit remains 
bounded: 


piaaee {m > 0: XA(X.y,,) bounded Vy, € Yn} ; 


We now prove the bound (4.58). Let m= mj,,, +1. We have to show that 
XB(X »y) is unbounded for y € Y,,. By decomposing into the case of even and odd 
n— k*, it follows that 

2m >n—k*. (4.74) 


In fact, if n — k* is even, n — k* = 2q, hence 


" n—k*—-1 
Maas = [-—S— 
which implies m = g and hence 2m = n — k*; the other case follows in a similar way. 
By the definition of k*, there exists 0 such that XO ¥ 0 and @’x; = 0 fora set of size k*. 
To simplify notation, we reorder the x; so that 


=[q-05]=q-1, 


6'x; =0 for i=1,...,k*. (4.75) 
Let, for some t € R, 


yt =y,+10'x, for i=k+1,..,k* +m (4.76) 
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y; =y; otherwise. (4.77) 


Then y* € ¥,,. Now let y** = y* — X@. Then y* = y; for 1 <i < k* by (4.75), and 
also for k* + 1 <i< k* +m by (4.76). Then y** € Y,,, since 


#i:y"=y)>kh+m 
and n — (k* +m) < m by (4.74). Hence the equivariance (4.15) implies that 
XB(X, y") — XB(X y™) = X(B(X y") — BCX, y" - 1X0) = 1X0, 


which is unbounded for ¢ € R, and thus both XB(X, y*) and XB(X, y**) cannot be 
bounded. 


4.9.4 The FBP of monotone M-estimators 


We now state the FBP of monotone M-estimators, which was derived by Ellis 
and Morgenthaler (1992) for the L, estimator and generalized by Maronna and 
Yohai (2000). 

Let y be nondecreasing and bounded. Call © the image of X¥: = {X@: 
0 ER}. For each €=(€,,...,€,)/ ER" let {i -f=l,....n}= fs be a 
permutation that sorts the |€,| in reverse order: 


lo l2--. 2 le. lb (4.78) 


and let 


m+] 
m= in { Daly cr) lé, \. (4.79) 


j=mt+2 


Then it is proved in Maronna and Yohai (1999) that 
m = m*(X)=min{m(é): €€ FE, E40}. (4.80) 


Ellis and Morgenthaler (1992) give a version of this result for the L, estimator, and 
use the ratio of the sums on both sides of the inequality in (4.79) as a measure of 
leverage. 

In the location case we have x; = 1, hence all €; are equal, and the condition 
in (4.79) is equivalent to m+ 1 >n—m-+1, which yields m(€) = [(n — 1)/2] as in 
(3.26). 

Consider now fitting a straight line through the origin with a uniform design x; = i 
(i= 1,...,n). Then, for all € 4 0,6; is proportional to n —i+ 1, and hence 


m+1 
mg) = in { 2 j+D> y Ge wo}. 


j=m+2 
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The condition between braces is equivalent to 
n(n + 1) > 2(n—m)n-—m-— 1), 


and for large n this is equivalent to (1 — m/n)* < 1/2; that is, 


M=1-1/2 20.29. 
n 

The case of a general straight line is dealt with in a similar way. The proof 
of (4.59) is not difficult, but that of (4.60) is rather involved (see Maronna and 
Yohai, 1999). 

If X is uniformly distributed on a p-dimensional spherical surface, it can be proved 
that e* ~ +/0.5/p for large p (Maronna et al., 1979) showing that even a fixed design 
without leverage points may yield a low BP if p is large. 


4.10 Recommendations and software 


For linear regression with fixed predictors without leverage points we recommend the 
bisquare M-estimator starting from L, (Section 4.4.2), which can be computed using 
ImrobM (Robust TM). ImrobLinTest (RobSt atTM) performs the robust LRT test 
for linear hypotheses described in Section 4.7.2. 


4.11 Problems 


4.1. Let B be a solution of (4.39) with fixed o. Show that: 
(a) if y; is replaced by y; + xy, then B + is a solution 
(b) if x; is replaced by Ax;, then AB is a solution. 


4.2. Let B be a solution of (4.39) where 6 verifies (4.48). Then B is regression, affine 
and scale equivariant. 


4.3. Show that the solution B of (4.37) is the LS estimator of the regression of y* 
on x,, where y =é(,;, x'B, 6), with € being “pseudo-observations”, defined as 
in (2.95). Use this fact to define an iterative procedure to compute a regression 
M-estimator. 


4.4. Show that the L, estimator for the model of regression through the origin 
y; = Bx; + u; is the median of z; = y;/x,;, where z; has probability proportional 
to |x;|. 

4.5. Verify (4.53) and show that, at each step of the median polish algorithm, the sum 
Did jl — H— @; — 7;| does not increase. 
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4.6. 


4.7. 


4.8. 


4.9. 
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Table 4.5 Hearing data 


Occupation 
Frequency I II Il IV Vv VI Vil 
500 2:1 6.8 8.4 1.4 14.6 7.9 4.8 
1000 1.7 8.1 8.4 1.4 12.0 3.7 4.5 
2000 14.4 14.8 27.0 30.9 36.5 36.4 31.4 
3000 57.4 62.4 37.4 63.3 65.5 65.6 59.8 
4000 66.2 81.7 53.3 80.7 79.7 80.8 82.4 
6000 75.2 94.0 74.3 87.9 93.3 87.8 80.5 
Normal 4.1 10.2 10.7 35 18.1 11.4 6.1 


I, Professional-managerial; I, farm; II, clerical sales; IV, craftsmen; V, operatives; 
VI, service; VII, laborers. 


Write computer code for the median polish algorithm and apply it to the original 
and modified oats data of Example 4.2 and to the data of Problem 4.9. 


Show that, for large n, the FBP given by (4.80) for fitting y; = pxt +u,; with a 
uniform design of n points is approximately 1 — Ose) 


Show that for the fit of y,; = Bx; + u; with design (4.61), the FBP given by (4.80) 
is zero. 


Table 4.5 (Roberts and Cohrssen, 1968) gives prevalence rates in percentage 
terms for men aged 55-64 with hearing levels 16 dB or more above the audio- 
metric zero, at different frequencies (hertz) and for normal speech. The columns 
classify the data into seven occupational groups: professional-managerial, farm, 
clerical sales, craftsmen, operatives, service, and laborers. (The dataset is called 
hearing). Fit an additive ANOVA model by LS and robustly. Compare the effect 
estimations. This data has also been analyzed by Daniel (1978). 
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5.1 Introduction 


Chapter 4 concentrated on robust regression estimators for situations where the 
predictor matrix X contains no rows x; with high leverage, and only the responses y 
may contain outliers. In that case a monotone M-estimator is a reliable starting point 
for computing a robust scale estimator and a redescending M-estimator. But when X 
is random, outliers in X operate as leverage points, and may completely distort the 
value of a monotone M-estimator when some pairs (x;, y;) are atypical. This chapter 
will deal with the case of random predictors and one of its focuses is on how to 
obtain good initial values for redescending M-estimators. 

The following example shows the failure of a monotone M-estimator when X is 
random and there is a single atypical observation. 


Example 5.1 = Smith et al. (1984) measured the contents (in parts per million) of 22 
chemical elements in 53 samples of rocks in Western Australia. Tables and figures for 
this example can be obtained with script mineral.R. 


Figure 5.1 plots the zinc (Zn) and the copper (Cu) contents against each other. Obser- 
vation 15 stands out as clearly atypical. The LS fit is seen to be influenced more by 
this observation than by the rest. However, the L, fit exhibits the same drawback. 
Neither the LS nor the L, fits represent the bulk of the data, since they are “attract- 
ed” by observation 15, which has a very large abscissa and too high an ordinate. By 
contrast, the LS fit omitting observation 15 gives a good fit to the rest of the data. 
Figures 5.2 and 5.3 show the Q-Q plot and the plot of residuals vs fitted values for 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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Zinc 


Figure 5.1 Mineral data: fits with LS, L,, and LS without observation 15 
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Figure 5.2 Mineral data: Q—Q plot of LS residuals 
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Figure 5.3. Mineral data: LS residuals versus fit 
Table 5.1 Regression coefficients for mineral data 
LS L, LS(—15) Robust 


Intercept 7.960 10.412 15.491 12.913 
Slope 0.134 0.080 0.030 0.044 


the LS estimator. Neither figure reveals the existence of an outlier, as indicated by an 
exceptionally large residual. However, the second figure shows an approximately lin- 
ear relationship between residuals and fitted values — except for the point with largest 
fitted value — and this indicates that the fit is not correct. 

Table 5.1 gives the estimated parameters for the LS and L, fits, as well as for 
the LS fit computed without observation 15, and for a redescending regression 
M-estimator to be described shortly. 

The intuitive reason for the failure of the L, estimator (and of monotone 
M-estimators in general) in this situation is that the x; outlier dominates the solution 
to (4.40) in the following sense. If, for some /, x; is “much larger than the rest”, then 
in order to make the sum zero, the residual y,; — x/B must be near zero and hence 
B is essentially determined by (x;, y,;). This does not happen with the redescending 
M-estimator. 


118 LINEAR REGRESSION 2 


5.2. The linear model with random predictors 


Situations like the one in the previous example occur primarily when x; are not fixed, 
as in designed experiments, but instead are random variables observed together 
with y;. We now briefly discuss the properties of a linear model with random X. 
Our observations are now the i.i.d. (p + 1)-dimensional random vectors (x;,, y;) 
(i = 1,...,n) satisfying the linear model relation 


yj = XP + Uj. (5.1) 


In the case of fixed X we assumed that the distribution of uw; does not depend on x;. 
The analogous assumption here is that 


the u; are i.i.d. and independent of the x;. (5.2) 


The analogue of assuming X is of full rank is to assume that the distribution of x is 
not concentrated on any subspace; that is, P(a’x = 0) < | for alla 4 0. This condition 
implies that the probability that X has full rank tends to 1 when n — oo, and holds in 
particular if the distribution of x has a density. Then the LS estimator is well defined, 
and (4.18) holds conditionally on X: 


E(B,5[X) = B, Var(By5|X) = 0?(X'X) |, 


where o7 =Var(u), 
Also (4.23) holds conditionally: if the u; are normal then the conditional distri- 
bution of 6,5 given X is multivariate normal. If the u; are not normal, assume that 


VV, =Exx’ (5.3) 
exists. It can be shown that 
a Cy 
D(Brs) ® N, B, Py (5.4) 
where 
Cy = 0° Vy! (3:5) 


is the asymptotic covariance matrix of B: see Section 10.10.2. The estimation of C3 
is discussed in Section 5.6. 

In the case (4.5) where the model has an intercept term, it follows from (5.5) that 
the asymptotic covariance matrix of (fp, B,) is 


o( 1TH he ) 
om a 5.6 
& Cc; oF 


where 
Bw, =Ex, C, = Var(x). 
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5.3. M-estimators with a bounded p-function 


Our approach to robust regression estimators where both the x; and the y; may contain 
outliers is to use an M-estimator # defined by 


3 (2) =qiia (5.7) 
q (oy 


with a bounded p and a high BP preliminary scale G. The scale ¢ will be required to 
fulfill certain requirements discussed in Section 5.5. If p has a derivative y it follows 


that B solves . 
dv ()a-o. (5.8) 
o 


i=1 


where y is redescending (it is easy to verify that a function p with a monotonic deriva- 
tive y cannot be bounded). Consequently, the estimating equation (5.8) may have 
multiple solutions corresponding to multiple /ocal minima of the function on the 
left-hand side of (5.7), and generally only one of them (the “good solution’) cor- 
responds to the global minimizer B defined by (5.7). We shall see that p and 6 may 
be chosen in order to attain both a high BP and a high efficiency. 

In Section 5.5 we describe a particular computing method for approximating B as 
defined by (5.7). The method is called an MM-estimator, and as a demonstration of its 
use we apply it to the data of Example 5.1. The results, displayed in Figure 5.4, show 
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Figure 5.4 Mineral data: fits by MM estimator (“ROB”) and by LS without the 
outlier 
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Figure 5.5 Mineral data: residuals versus fitted values of MM-estimator 


that the MM-estimator almost coincides with the LS estimator computed with data 
point 15 deleted. The MM-estimator intercept and slope parameters are now 14.05 
and 0.02, respectively, as compared to 7.96 and 0.13 for the LS estimator (recall 
Table 5.1). 

Figure 5.5 shows the residuals plotted against fitted values and Figure 5.6 shows 
the Q-Q plot of the residuals. The former now lacks the suspicious structure of 
Figure 5.3 and point 15 is now revealed as a large outlier in the residuals as well 
as the fit, with a considerably reduced value of fit (roughly 40 instead of more than 
90). Moreover, compared to Figure 5.2 the Q—Q plot now clearly reveals point 15 
as an outlier. Figure 5.7 compares the sorted absolute values of residuals from the 
MM-estimator fit and the LS fit, with point 15 omitted for reasons of scale. It is seen 
that most points lie below the identity diagonal, showing that, except for the outlier, 
the sorted absolute MM-residuals are smaller than those from the LS estimator, and 
hence the MM-estimator fits the data better. 


5.3.1 Properties of M-estimators with a bounded p-function 


If G is regression and affine equivariant, as defined in (4.48), then the estimator B 
defined by (5.7) is regression, scale and affine equivariant. We now discuss the break- 
down point, influence function and asymptotic normality of such estimators. 
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Figure 5.6 Mineral data: Q—Q plot of robust residuals 
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Figure 5.7 Mineral data: sorted absolute values of robust versus LS residuals (point 
15 omitted) 


122 LINEAR REGRESSION 2 


5.3.1.1 Breakdown point 


We focus on the finite breakdown point (FBP) of B. Since the x are now random, we 
are in the situation of Section 3.2.5. Let z; = (x;, y;) and write the estimator as B(Z) 
with Z={z,, ...,Z,,}. Then, instead of (4.55), define e* = m*/n, where 


m* = max{m>0: B(Z,,) bounded V Z,, € Z,,}, (5.9) 


and Z,,, is the set of datasets with at least n — m elements in common with Z. Note 
that since not only y but also X are variable here, the FBP given by (5.9) is less than 
or equal to that given earlier by (4.55). 

It is then easy to show that the FBP of monotone M-estimators is zero 
(Section 5.13.1). Intuitively, this is due to the fact that a term with a “large” x; 
“dominates” the sum in (5.7). Then the scale used in Section 4.4.2, which is based 
on the residuals from the L, estimator, also has a zero BP. 

On the other hand, it can be shown that the maximum FBP of any regression 
equivariant estimator is again the one given in Section 4.6: 


@ i lfn-k*-1 lyn =) 
< =: — | ——————]| < — | — . 
© S Fmax al 2 el 2/7 onl) 
with k* as in (4.56): 
k*(X) = max{#(0’x; = 0) : OER’, O40} (5.11) 


The proof is similar to that of Section 4.9.3, and we shall see that this bound is attained 

by several types of estimator to be defined in this chapter. It can be shown in the 

same way that the maximum asymptotic BP for regression equivariant estimators is 
(1 — a)/2, where 

= P(O’x = 0). 12 

a pe (Ox =0) (5.12) 


In the previous chapter, our method for developing robust estimators was to gener- 
alize the MLE, which leads to M-estimators with unbounded p. In the present setting, 
calculating the MLE again yields (4.36); in particular, LS is the MLE for normal u, 
for any x. Thus no new class of estimators emerges from the ML approach. 


5.3.1.2 Influence function 
If the joint distribution F of (x, y) is given by the model (5.1)—(5.2), then it follows 


from (3.48) that the influence function (IF) of an M-estimator with known o under 
the model is 
Yo — XB 
IF((Xp, Yo), F) = oy (a) Vz!xq with b = Ey’ (+) (5.13) 
o o 


and with V, defined by (5.3). The proof is similar to that of Section 3.8.1. It 
follows that the IF is unbounded. However, the IFs for the cases of monotone 
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and of redescending y are rather different. If y is monotone, then the IF tends to 
infinity for any fixed Xo if yg tends to infinity. If y is redescending and is such that 
w(x) = 0 for |x| > k, then the IF will tend to infinity only when x, tends to infinity 
and |yy — xB |/o <k, which means that large outliers have no influence on the 
estimator. 

When o is unknown and is estimated by 6, it can be shown that if the distribution 
of u; is symmetric, then (5.13) also holds, with o replaced by the asymptotic value 
of G. 

The fact that the IF is unbounded does not necessarily imply that the bias is 
unbounded for any positive contamination rate e. In fact, while a monotone yw implies 
BP =0, we shall see in Section 5.5 that with a bounded p it is possible to attain a high 
BP, and hence that the bias is bounded for large values of €. On the other hand, in 
Section 5.11.1 we shall define a family of estimators with bounded IF, but such that 
their BP may be very low for large p. These facts indicate that the IF need not yield 
a reliable approximation to the bias. 


5.3.1.3 Asymptotic normality 


Assume that the model (5.1)-(5.2) holds, that x has finite variances, and that 
converges in probability to some o. Then it can be proved under rather general 
conditions (see Section 10.10.2 for details) that the estimator B defined by (5.7) is 
consistent and asymptotically normal. More precisely 


Vn(B - B) > aN, (0, 0Vz'), (5.14) 
where V, = Exx’, and v is as in (4.44): 
_ 9 Ewtu/oy 
v=0 Ew'ujoy? (5.15) 


This result implies that as long as x has finite variances, the efficiency of B does not 
depend on the distribution of x. 

The Fisher-consistency (Section 3.5.3) of M-estimators with random predictors 
is shown in general in Section 10.11. 

We have seen in Chapter 4 that a leverage point forces the fit of a monotone 
M-estimator to pass near the point, and this has a double-edged effect: if the point is 
a “typical” observation, the fit improves (although the normal approximation to the 
distribution of the estimator deteriorates); if it is “atypical”, the overall fit worsens. 
The implications of these facts for the case of random x are as follows. Suppose that 
x is heavy tailed so that its variances do not exist. If the model (5.1)—(5.2) holds, then 
the normal ceases to be a good approximation to the distribution of B. but at the same 
time B is “closer” to B than in the case of “typical” x (see Section 5.13.2 for details). 
But if the model does not hold, then B may have a higher bias than in the case of 
“typical” x. 
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5.4 Estimators based on a robust residual scale 


The M-estimators defined in (5.7) require a robust scale estimator o, which typically 
is computed using residuals from a robust regression estimator. In this section, we 
shall present a family of regression estimators that do not depend on a preliminary 
residual scale, and thus break this cycle. They can be used to compute the preliminary 
scale G in (5.7). 

Note that the LS and the L, estimators minimize the averages of the squared and of 
the absolute residuals respectively, and therefore they minimize measures of residual 
largeness that can be seriously influenced by even a single residual outlier. A more 
robust alternative is to minimize a scale measure of residuals that is insensitive to 
large values, and one such possibility is the median of the absolute residuals. This is 
the basis of the least median of squares (LMS) estimator, introduced as the first esti- 
mator of this kind by Hampel (1975) and by Rousseeuw (1984) who also proposed 
a computational algorithm. In the location case, the LMS estimator is equivalent to 
the Shorth estimator, defined as the mid-point of the shortest half of the data (see 
Problem 2.16a). For fitting a linear model, the LMS estimator has the intuitive prop- 
erty of generating the strip of minimum width that contains half of the observations 
(Problem 5.9). 

Let 6 = G(r) be a location-invariant and scale-equivariant robust scale estimator 
based on a vector of residuals 


r(B) =(7(B), .--» 7 (B)): (5.16) 


Then a regression estimator can be defined as 


n 


B = arg _ o(r()). (5.17) 


Such estimators are regression, scale, and affine equivariant (Problem 5.1). 


5.4.1 S-estimators 


A very important case of (5.17) is when G(r) is a scale M-estimator defined for each 
r as the solution to : 
= —)=6, 5.18 
7 2 P\= (5.18) 


where p is a bounded p-function. By (3.23), the asymptotic BP of ¢ is min(6, 1 — 6). 
The resulting estimator (5.17) is called an S-estimator (Rousseeuw and Yohai, 1984). 
See Section 5.9.4 for the choice of an initial estimator Bo for the MM-estimator. 
We now consider the BP of S-estimators. Proofs of all results on the BP are given 
in Section 5.13.4. The maximum FBP of an S-estimator with a bounded p-function is 
m 


Emax = — ; (5.19) 
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where m;,,, is the same as in (4.58), namely 


m= I. (5.20) 


max ye) 
where k* is defined in (5.11). Hence €7,,, coincides with the maximum BP for 
equivariant estimators given in (5.10). This BP is attained by taking any 6 of the form 
Wax + 
gaa! with ye). (5.21) 
n 


Recall that k* > p— 1, and if k* =p—1 we say that X is in general position. 
When X is in general position, the maximum FBP is 


m= ES 
max » 2 ? 


which is approximately 0.5 for large n. Similarly, the maximum asymptotic BP of a 
regression S-estimator with a bounded p is 


et = , (5.22) 


with a defined in (5.12), and thus coincides with the maximum asymptotic BP for 
equivariant estimators given in Section 5.3.1.1 This maximum is attained by taking 
6 = (1 -a)/2. If x has a density then a = 0, and hence 6 = 0.5 yields e* = 0.5. 

Since the median of absolute values is a scale M-estimator, the LMS estima- 
tor may be written as the estimator minimizing the scale G given by (5.18), with 
p(t) = I(|t| < 1) and 6 = 0.5. For a general 6, a solution 6 of (5.18) is the Ath order 
statistic of |7;|, with h = n — [nd] (Problem 2.14). The regression estimator defined 
by minimizing G is called the least a-quantile estimator, with a = h/n. Although it 
has a discontinuous p-function, the proof of the preceding results (5.20)-(5.21) can 
be shown to imply that the maximum BP is again (5.19) and that it can be attained 
by choosing 

7 E +k* + 2) 

max 2 ? 
which is slightly larger than n/2. See the end of Section 5.13.4. 

We deal now with the efficiency of S-estimators. Since an S-estimator B mini- 
mizes 6 = o(r(B)) it follows that B is also an M-estimator (5.7) in that 


¥ (2) < 3 (2) for all B, (5.24) 
(or oO 


i=1 i=l 


h=n-m 


(5.23) 


where G = G(r(B)) is the same in the denominator on both sides of the equation. 
To see that this is indeed the case, suppose that for some fB we had 


Sef HB) “  r(B) 
Xo“ )<de( YP) 
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Then by the monotonicity of p there would exist 6 < G such that 


55(D) =n 
i=l 2 


which would contradict the fact that 6 is the minimum M-scale. If p has a derivative y, 
it follows that B is also an M-estimator in the sense of (5.8), but with the condition 
that the scale G = G(r(f)) is estimated simultaneously with f. 

Because S-estimators are M-estimators, it follows that the asymptotic distri- 
bution of an S-estimator with a smooth p under the model (5.1)—(5.2) is given by 
(4.43)-(4.44); see Davies (1990) and Kim and Pollard (1990) for a rigorous proof. 
For the LMS estimator, which has a discontinuous p, Davies (1990) shows that 
B- B has a slow convergence rate of n~!/3, while estimators based on a smooth 
p-function have the usual convergence rate n~!/*, Thus the LMS estimator is highly 
inefficient for large n. 

Unfortunately S-estimators with a smooth p cannot simultaneously have high BP 
and high efficiency. In particular, it was shown by Héssjer (1992) that an S-estimator 
with BP = 0.5 has an asymptotic efficiency under normally distributed errors that is 
not larger than 0.33. In fact, numerical computation shows that, for normal distribu- 
tions, the efficiency of S-estimators based on the bisquare p function is 0.29, which 
is adequately close to the upper bound. 

Since an S-estimator with a differentiable p-function satisfies (5.8), its IF is given 
by (5.13) and hence is unbounded. See, however, the comments on p. 132. Note also 
that S-estimators are “redescending” in the sense that if some of the y, are “too large’, 
the estimator is completely unaffected by these observations, and coincides with an 
M-estimator computed after deleting such outliers. A precise statement is given in 
Problem 5.10. 

Algorithms to compute S-estimators are discussed in Section 5.7.1. 


5.4.2 L-estimators of scale and the LTS estimator 


An alternative to using an M-scale is to use an L-estimator of scale. Call |7|(;) < .... < 
Ir|(n) the ordered absolute values of residuals. Then we can define scale estimators as 
linear combinations of the |r|(, in one of the two following forms: 


2 


n n 1/2 
a nA 2 
o= Dd alrle: or o= ( ain ; 


i=1 i=1 


where the a; are nonnegative constants. 
A particular version of the second form is the a—trimmed squares scale where 
a € (0, 1), and n —h = [na] of the largest absolute residuals are trimmed: 


h 1/2 
G= (Dwi) ; (5.25) 
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The corresponding regression estimator is called the least trimmed squares (LTS) 
estimator (Rousseeuw, 1984). The FBP of the LTS estimator depends on h in the 
same way as that of the LMS estimator, so that for the LTS estimator to attain the 
maximum BP one must choose / in (5.25) as in (5.23). In particular, when X is in 
general position one must choose 


je) jes 
h=n-— |—————| = — 7 


which is approximately n/2 for large n. The asymptotic behavior of the LTS estimator 
is more complicated than that of smooth S-estimators. However, it is known that they 
have the standard convergence rate of n~!/*, and it can be shown that the asymptotic 
efficiency of the LTS estimator for the normal distribution has the exceedingly low 
value of about 7%; see Rousseeuw and Leroy (1987; p. 180). 


5.4.3 t—estimators 


In order to improve the efficiency of regression estimators based on scale estimators 
Yohai and Zamar (1988) proposed a different scale estimator to be used in (5.17). 
As before, given a vector of residuals r = (r,,...,7,,), let G(r) be a robust M-scale 


satisfying 
1 n I; 
= x)=6, 5.26 
: 2 Po ( z ) (5.26) 


where pp is a bounded p-function, tuned to obtain the desired BP. Now use another 
bounded p-function p, to define the r-scale as 


i ee VS (a 
rr) = (ry 3 de ( a), (5.27) 


where the constant 6, satisfies Ep,(Z) = 6,, with Z ~ N(0, 1), which ensures consis- 
tency for Gaussian errors. The t-regression estimator is defined by 


n 


B = arg mn tT(r(B)). (5.28) 


Note that although the above definition is in line with (5.17), an important difference 
is that the p-function in (5.27) can be tuned separately from pp in (5.26) in order to 
improve the efficiency of the resulting regression estimator. An intuitive motivation 
is that the LS estimator is a special case of (5.28) when p,(r) = r, and hence by 
using an adequate choice of p, the estimator can be made arbitrarily close to the LS 
estimator, which will yield an arbitrarily high efficiency for the normal distribution. 

Yohai and Zamar (1988) showed that r-estimators satisfy an M-estimating 
equation (5.8), where the score function y is a linear combination of Po and p’ 
with coefficients depending on the data. From this observation it follows that B 
has an asymptotically normal distribution. Its asymptotic efficiency at the normal 
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distribution can be adjusted to be arbitrarily close to 1, by tuning the function py, 
just as in the case of MM-estimators. The BP of the t-estimators is the same as that 
of an S-estimator based on po, and so by suitable choice of pg, the estimator can 
attain the maximum BP for regression estimators. 

Efficient algorithms to compute t-regression estimators were studied by 
Salibian-Barrera et al. (2008a). They showed that the strategy discussed in 
Section 5.7.3 can be applied successfully to t-regression estimators. R code imple- 
menting this algorithm is publicly available online at https://github.com/msalibian/ 
fast-tau. 

Note that the value of the r-scale estimator corresponding to the regression esti- 
mator — the value z(r(B)) in (5.28) —is a robust and efficient residual scale estimator. 
As such, these estimators can be used to build robust tests for linear hypotheses 
(see Section 4.7) with good level and power properties. In particular, one can con- 
struct ANOVA-type tests of the form (4.65) where the sums of squared residuals 
are replaced by t-scale estimators. Salibian-Barrera et al. (2016) studied such tests 
and used a robust bootstrap method (see Section 5.6.1) to estimate the corresponding 
p-values. R code implementing these tests is publicly available at https://github.com/ 
msalibian/tau-tests. 


5.5 MM-estimators 


Computing an M-estimator requires finding the absolute minimum of 


L(B)= > p (“e *) (5.29) 


A 
Oo 


i=1 

When p is bounded (and thus non-convex) this is an exceedingly difficult task, except 
for the cases when p = | or 2 where a grid search would work. However, we shall see 
that it suffices to find a “good” local minimum to achieve both a high BP and high 
efficiency for a normal distribution. This local minimum will be obtained by starting 
from a reliable starting point and applying the IRWLS algorithm of Section 4.5.2. 
This starting point will also be used to compute the robust residual scale G required 
to define the M-estimator, and hence it is necessary that it can be computed without 
requiring a previous scale. 

The L, estimator does not require a scale, but we have already seen that it is not 
a convenient estimator when X is random. Hence we need an initial estimator that is 
robust toward any kind of outliers and that does not require a previously computed 
scale. We choose an S-estimator. 

The steps of the proposed procedure are thus: 


1. Compute an initial consistent estimator Bo with high breakdown point but 
possibly low normal efficiency. 

2. Compute a robust scale @ of the residuals (Bo): 

3. Find a solution B of (5.7) using an iterative procedure starting at Bo- 
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We shall demonstrate that in this way we can obtain B having both a high BP and 
a prescribed high efficiency at the normal distribution. 

Now we look at the details of the above steps. The robust initial estimator By 
must be regression, scale and affine equivariant, which ensures that B inherits the 
same properties. We choose an S-estimator with bisquare scale. We shall use two 
different functions, p and pg, and each of these must be a bounded p-function in the 
sense of Definition 2.1 at the end of Section 2.3.4. The scale estimator G must be an 
M-scale estimator (2.49) given by 


; Yo (2) = 55 (5.30) 


By (3.23) the asymptotic BP of G is 0.5. As was seen at the end of Section 2.5, we 
can always find cy such that using po(r/cg) ensures that the asymptotic value of o 
coincides with the standard deviation when the wu; are normal. For the bisquare scale 
given by (2.52) this value is cy = 1.56. 

The key result is given by Yohai (1987), who called these estimators 
MM-estimators. Recall that all local minima of L(B) satisfy (5.8). Let p satisfy 


Po = Pp: (5.31) 
Yohai (1987) shows that if B is such that 


L(B) < L(Bo) (5.32) 


then B is consistent. It can also be shown — in the same way as the similar result for 
location in Section 3.2.3 — that its BP is not less than that of Bo- If, furthermore, B 1S 
any solution of (5.8), then it has the same efficiency as the global minimum. Thus it 
is not necessary to find the absolute minimum of (5.7) to ensure a high BP and high 
efficiency. 

The numerical computation of the estimator follows the approach in Section 4.5: 
starting with By we use the IRWLS algorithm to obtain a solution of (5.8). It is 
shown in Section 9.1 that L(B) given in (5.29) decreases at each iteration, which 
ensures (5.32). 

It remains to choose p in order to attain the desired normal efficiency, which is 
1/v, where v is the expression (5.15) computed at the standard normal. Let p* be any 
bounded p-function; for instance the bisquare given by (2.38) with k = 1. Let 


po(r) = p” (=) and p(r) = p* (=) 
0 1 


where cg is chosen for consistency of the scale at the normal. In particular, when p* 
is the bisquare function, cy = 1.56. In order that p < py we must have c, > cg: the 
larger c,, the higher the efficiency at the normal distribution. The values of c, for 
prescribed efficiencies are the values of k in Table (2.4). 
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In Section 5.8 we shall demonstrate the basic trade-off between normal efficiency 
and bias under contamination: the larger the efficiency, the larger the bias. It is there- 
fore important to choose the efficiency so as to maintain reasonable bias control. The 
results in Section 5.8 show that an efficiency of 0.95 yields too high a bias, and hence 
it is safer to choose an efficiency of 0.85, which gives a smaller bias while retaining 
sufficiently high efficiency. 

Note that M-estimators, and MM-estimators in particular, have an unbounded IF 
but a high BP. This seeming contradiction can be resolved by noting that an infi- 
nite gross-error sensitivity means only that the maximum bias for €-contamination, 
MB(e), is not O(e) for small €, but does not imply that it is infinite! Actually, Yohai 
and Zamar (1997) have shown that MB(eé) = O( Ve) for the estimators considered 
in this section. This implies that the bias induced by altering a single observation is 
bounded by c/ Jn for some constant c, instead of the stronger bound c/n. 


Example 5.2 The next example is based on the “modified wood gravity data”. The 
raw data came from Draper and Smith (1966, p. 227) and were used to determine the 
influence of anatomical factors on wood specific gravity, with 20 cases, five explana- 
tory variables and an intercept. Rousseeuw and Leroy (1987) modified the data by 
replacing four observations (4, 6, 8, and 19) by outliers (dataset wood). The tables 
and figures for this example can be obtained with script wood.R. 


Figures 5.8 and 5.9 show the plot of the residuals against fit and the normal 
Q-Q plot for the LS estimator. No outliers are apparent. Figures 5.10 and 5.11 are 
the respective plots for the 85% normal efficiency MM-estimator, clearly showing 
the four outliers 4, 6, 8 and 19. Figure 5.12 plots the ordered absolute residuals 
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Figure 5.8 Wood gravity data: LS residuals versus fit 
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Wood gravity data: MM residuals versus fit 
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Normal Q-Q Plot 
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Figure 5.11 Wood gravity data: Q—Q plot of robust residuals 
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Figure 5.12 Wood gravity data: ordered absolute residuals from MM and from LS 
(largest residuals omitted) 
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from LS as the abscissa and those from the MM-estimator as the ordinate, as com- 
pared to the identity line; the observations with the four largest absolute residuals 
from the MM-estimator were omitted for reasons of scale. The plot shows that the 
MM-residuals are in general smaller than the LS residuals, and hence MM gives a 
better fit to the bulk of the data. 


5.6 Robust inference and variable selection 
for M-estimators 


In general, estimators that fulfill an M-estimating equation like (5.8) are asymp- 
totically normal, and hence approximate confidence intervals and tests can be 
obtained as in Sections 4.4.2 and 4.7.2. Recall that, according to (5.14), B has an 
approximately normal distribution, with covariance matrix given vn~!V;!. For the 
purposes of inference, V, and v can be estimated by 


V,= Ds = ly ne ave;{y(r;/6)?} on 
ant | (eae al [ave,{w'(r;/6)}2 n — p’ 


(5.33) 


and hence the resulting confidence intervals and tests are the same as those for fixed X. 

Actually, this estimator of V, has the drawback of not being robust. In fact, just 
one large x; corresponding to an outlying observation with a large residual may have 
a large distorting influence on X'X, with diagonal elements typically inflated. Since 0 
is stable with respect to outlier influence, the confidence intervals based on 6(X’X)~! 
may be too small and hence the coverage probabilities may be much smaller than the 
nominal. 

Yohai et al. (1991) proposed a more robust estimator of the matrix V,, defined as 


~ 1 n ; 
Vy = Sa OD, WiXiX; (5.34) 
© Yh 2 


with w; = W(r,/G), where W is the weight function (2.31). Under the model with n 
large, the residual r; is close to the error u;, and since u; is independent of x; we have, 
as > oo, 


i , u . 
= y W;X:X. Ew (+) Exx 
a ie 
and 
1 = u 
yw; ew(#), 
en c 


and thus 
VV ! 
Vx pExx’. 
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Assume that y(t) = 0 if |¢] > k for some k, as happens with the bisquare. Then, if 
|r;,|/6 > k, the weight w; is zero. If observation i has high leverage (i.e., x; is “large’’) 
and is an outlier, then, since the estimator is robust, |7,|/ 6G is also “large”, and this 
observation will have null weight and hence will not influence V,. On the other hand, 
if x; is “large” but 7; is small or moderate, then w; will be nonnull, and x; , will still 
have a beneficial influence on ve by virtue of reducing the variance of B. Hence 
the advantage of V; is that it downweights high-leverage observations only when 
they are outlying. Therefore, we recommend the routine use of Vv, instead of V,, for 
all instances of inference, in particular the Wald-type tests defined in Section 4.7.2, 
which also require estimating the covariance matrix of the estimators. 

The following example shows how different the inference can be when using an 
MM-estimator rather than the LS estimator. 

For the straight-line regression of the mineral data in Example 5.1, the slope 
given by LS and its estimated SD are 0.135 and 0.020 respectively, while the cor- 
responding values for the MM-estimator are 0.044 and 0.021; hence the classical 
and robust two-sided intervals with level 0.95 are (0.0958, 0.1742) and (0.00284, 
0.08516), which are disjoint, showing the influence of the outlier. 


5.6.1 Bootstrap robust confidence intervals and tests 


Since the confidence intervals and tests for robust estimators are asymptotic, their 
actual level may be lower than the desired one if n is not large. This occurs espe- 
cially when the error distribution is very heavy tailed or asymmetric (see the end of 
Section 10.3). Better results can be obtained using the bootstrap method, which sim- 
ulates the distribution of the estimator of interest by recomputing the estimator on 
a large number of new samples randomly drawn from the original data (bootstrap 
samples). It can be shown that, under certain regularity conditions, the empirical dis- 
tribution of the recomputed estimator converges to the sampling distribution of the 
estimator of interest. See, for example, Efron and Tibshirani (1993) and Davison and 
Hinkley (1997) for more details. 

While the bootstrap approach has proved successful in many situations, its appli- 
cation to robust estimators presents two main problems. One is that recomputing 
robust estimators on a large number of bootstrap samples may demand impractical 
computing times. Another difficulty is that the proportion of outliers in some of the 
bootstrap samples might be much higher than in the original one, severely affecting 
the recomputed estimator and the resulting estimated distribution. Salibian-Barrera 
and Zamar (2002) proposed a bootstrap method that is faster and more robust than 
the naive application of the bootstrap approach. The main idea is to express the robust 
estimator B as the solution to a fixed-point equation 


B= 2,(B). 


where the function g, : R? — RP? generally depends on the sample. Then, given a 
bootstrap sample, instead of solving the above equation for the bootstrap sample, we 
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compute a one-step approximation: 
A1* aD 
B =8,(B), 
where g* is the function corresponding to the bootstrap sample, but evaluated on the 
estimator computed on the original sample. It it is easy to see that the distribution of 


2° generally will not estimate that of B. However, it can be corrected by applying 
a simple linear transformation. A Taylor expansion of the fixed-point equation above 
shows that, under certain regularity conditions, the following approximation has the 
same limiting distribution as the fully bootstrapped estimator: 


p+-Ve,Al' BB), (5.35) 


where I denotes the identity matrix, and Vg,, is the matrix of first derivatives of g,,. 
Thus, as usual, we can estimate the distribution of (5.35) by evaluating it on many 
bootstrap samples. Formal consistency proofs for this approach under mild regularity 
conditions can be found in Salibian-Barrera and Zamar (2002) and Salibian-Barrera 
et al. (2006). 

This fast and robust way to estimate the sampling distribution of robust estimators 
can be used to derive confidence intervals and tests of hypotheses with good robust- 
ness properties (see, for example, Salibian-Barrera, 2005; Van Aelst and Willems, 
2011; and Salibian-Barrera et al, 2016). It has also been shown to provide a robust 
alternative to bootstrap-based model selection procedured (Salibian-Barrera and Van 
Aelst, 2008). 

A review of the method with applications to different models and settings can be 
found in Salibian-Barrera et al. (2008b). Code implementing this method for different 
robust estimators is available at https://github.com/msalibian and http://users.ugent 
.be/~svaelst/software/. 


5.6.2 Variable selection 


In many situations, the main purpose of fitting a regression equation is to predict 
the response variable. If the number of predictor variables is large and the number 
of observations relatively small, fitting the model using all the predictors will yield 
poorly estimated coefficients, especially when predictors are highly correlated. More 
precisely, the variances of the estimated coefficients will be high and therefore the 
forecasts made with the estimated model will have a large variance too. A common 
practice to overcome this difficulty is to fit a model using only a subset of variables, 
selected according to some statistical criterion. 

Consider evaluating a model using the mean squared error (MSE) of the forecast. 
This MSE is composed of the variance plus the squared bias. Deleting some predic- 
tors may cause an increase in the bias and a reduction of the variance. Hence the 
problem of finding the best subset of predictors can be viewed as that of finding the 
best trade-off between bias and variance. There is a very large literature on the subset 
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selection problem when the LS estimator is used as an estimation procedure; see for 
example Miller (1990), Seber (1984) and Hastie et al. (2001). 

Let the sample be (x;, y;), i= 1,...,n, where x; = (xj1,... sXjp): The predictors 
are assumed to be random, but the case of fixed predictors is treated in a similar 
manner. For each set C Cc {1,2,...,p},letq = #(C) andxjc = (ij jec € R’, Akaike’s 
(1970) Final Prediction Error (FPE) criterion based on the LS estimator is defined as 


FPE(C) = E(vg — xB (5.36) 


where Be is the estimator based on the set C and (Xo, yg) have the same joint distribu- 
tion as (x;, y;) and are independent of the sample. The expectation on the right-hand 
side of (5.36) is with respect to both (Xp, yo) and Be Then, it is shown that an approx- 
imately unbiased estimator of FPE is 


FPE*(C) = - y 2, (1 " 24) (5.37) 
i=l 


where 


_ IB 
Nic = Yi — XicBc- 


The first term of (5.37) evaluates the goodness of the fit when the estimator is Bos 
and the second term penalizes the use of a large number of explanatory variables. The 
best subset C is chosen as the one minimizing FPE *(C). 

It is clear, however, that a few outliers may distort the value of FPE* (C), so that 
the choice of the predictors may be determined by a few atypical observations. To 
robustify FPE, we must note that not only the regression estimator must be robust, 
but the value of the criterion should not be sensitive to a few residuals. 

We shall therefore robustify the FPE criterion by using for B a robust M-estimator 
(5.7) along with a robust error scale estimator 6. In addition, we shall bound the influ- 
ence of large residuals by replacing the square in (5.36) with a bounded p-function, 
namely the same p as in (5.7). To make the procedure invariant under scale changes, 
the error must be divided by a scale o, and to make consistent comparisons among 
different subsets of the predictor variable, o must remain the same for all C. Thus the 
proposed criterion, which will be called the robust final prediction error (RFPE), is 
defined as 


RFPE(C) = Ep (2-3) (5.38) 
oO 


where o is the asymptotic value of G. 
To estimate RFPE for each subset C, we first compute 


n / 
aA J Ji -xB 
Bc = arg ne DoS 


ROBUST INFERENCE AND VARIABLE SELECTION 137 


where the scale estimator G is based on the full set of variables, and define the esti- 
mator by 


RFPE*(C) = *¥ o() ee (5.39) 
n i=l fey NB 
where 
m n a) 7 n : 
a=*Yv(#), B=1 Dw (*). (5.40) 
| dae (oy n (oy 


Note that if p(r) = r?, then y(r) = 2r, and the result is equivalent to (5.37) since 6 
cancels out. The criterion (5.39) is justified in Section 5.13.7. 

When pis large, finding the optimal subset may be very costly in terms of compu- 
tation time and therefore strategies to find suboptimal sets can be used. Two problems 
arise: 


e Searching over all subsets may be impractical because of the extremely large 
number of subsets. = 

e Each computation of RFPE* requires recomputing a robust estimator 6 for each 
C, which can be very time-consuming when performed a large number of times. 


In the case of the LS estimator, there exist very efficient algorithms to compute 
the classical FPE (5.37) for all subsets (see the references above), and so the first 
problem above is tractable for the classical approach if p is not too large. But 
computing a robust estimator Bc for all subsets C would be infeasible unless p were 
small. A simple but frequently effective suboptimal strategy is stepwise regression: 
add or remove one variable at a time (“forward” or “backward” regression), choosing 
the one whose inclusion or deletion yields the lowest value of the criterion. Various 
simulation studies indicate that the backward procedure is better. Starting with 
C= {1,...,p}, we remove one variable at a time. At step k (= 1,...,p— 1), we 
have a subset C with #(C) = p — k + 1, and the next predictor to be deleted is found 
by searching over all subsets of C of size p — k to find the one with smallest RFPE*. 

The second problem above arises because robust estimators are computationally 
intensive, the more so when there is a large number of predictors. A simple way 
to reduce the computational burden is to avoid repeating the subsampling for each 
subset C by computing a. starting from the approximation given by the weighted 
LS estimator with weights w, obtained from the estimator corresponding to the full 
model. 


Example 5.3. Jo demonstrate the advantages of using a robust model selection 
approach based on RFPE”*, we shall use a simulated dataset from a known model 
for which the “correct solution” is clear. The results in this example are obtained 
with script step.R. 


We generated n = SO observations from the model y; = fy +x/B, +u; with 
fy = 1 and B' = (1,1, 1,0,0,0), so that p= 7. The u; and Xj are 1.i.d. standard 
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Table 5.2 Variable selection for simulated data 


LS Robust 
Vars. AIC Vars. RFPE* 
123456 240.14 123456 6.73 
13456 238.16 12356 6.54 
2456 237.33 1235 6.35 
456 237.83 123 6.22 
23 7.69 


normal. Here, a perfect model selection method would select the variables {1, 2,3}. 
We changed the values of the first six observations for outliers, the values of y by 
25+ 5i and those of x4,x5,%% by i/2, 1<i< 6. We then applied the backward 
stepwise procedure using both the RFPE* criterion based on a MM-estimator and 
the Akaike information criterion (AIC) based on the LS estimator. While the RFPE* 
criterion gives the correct answer, selecting variables 1, 2 and 3, the AIC criterion 
selects variables 2, 4, 5 and 6. Both selecting processes are shown in Table 5.2. 

Other approaches to robust model selection were given by Qian and Ktinsch 
(1998), Ronchetti and Staudte (1994) and Ronchetti et al. (1997). 


5.7 Algorithms 


In this section, we discuss successful strategies to compute the robust regression 
estimators discussed above. Both classes of estimators presented in this chapter 
(redescending M-estimators, and those based on minimizing a robust residual scale 
estimator) can be challenging to calculate because they are defined as the minimum 
of non-convex objective functions with many variables. In addition, some of these 
functions are not differentiable. As an illustration, consider an estimator based on a 
robust scale, as in (5.7.3), for a simple linear regression model through the origin. 
We simulated 1 = 50 observations from the model y; = f x; + u;, where x; and u; are 
iid. N(O, 1). The true 6 = 0, and we added three outliers located at (x, y) = (10, 20). 
Figure 5.13 shows the loss functions to be minimized when computing the LMS, the 
LTS estimator with a = 0.5, an S-estimator with a bisquare p-function, and the LS 
estimator. The corresponding figure for an MM-estimator with a bisquare p function 
is very similar to that of the S-estimator. We see that the LMS loss function is very 
jagged, and that all estimators except LS exhibit a local minimum at about 2, which 
is a “bad solution”. The global minima of the loss functions for these four estimators 
are attained at the values of 0.06, 0.14, 0.07 and 1.72, respectively. 
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Figure 5.13 Loss functions for different regression estimators 


The loss functions for the LMS and LTS estimators are not differentiable, and 
hence gradient methods cannot be applied to them. Stromberg (1993a,b) gives an 
exact algorithm for computing the LMS estimator, but the number of operations it 
requires is of order ne i) which is only practical for very small values of n and p. 
For other approaches see Agull6 (1997, 2001) and Hawkins (1994). The loss function 
for the bisquare S-estimator is differentiable, but since gradient methods ensure only 
the attainment of a local minimum, a “good” starting point is needed. 

In Section 5.7.1 we present iterative algorithms that can be shown to improve 
the corresponding objective function at each step. We will refer to these as “local 
improvements”. The lack of convexity, however, means that usually there are several 
local minima, and hence these algorithms may not converge to the global minimum 
of the objective function. A strategy to solve this problem is to start the iterations 
from a large number of different initial points. Rather than using starting values taken 
completely at random from the set of possible parameter values, it is generally more 
efficient to let the data guide the construction of initial values. Below we discuss two 
such methods: subsampling, and an estimator proposed by Pefia and Yohai (1999). 
The first is widely used and is discussed in Section 5.7.2. Since this method requires 
using a large number of starting points, one can reduce the overall computing time 
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by identifying, early in the local improvement or “concentration” iterations, which 
sub-sampling candidates are most promising (see Rousseeuw and van Driessen, 2000; 
and Salibian-Barrera and Yohai, 2006). This variant of the subsampling algorithm is 
presented in Section 5.7.3. The use of Pefia—Yohai starting points has only recently 
started to receive attention in the literature, but we show in Section 5.7.4 below that 
it compares very favorably to sub-sampling. 


5.7.1 Finding local minima 


The iterative reweighted least squares algorithm discussed in Section 4.5.2 can be 
used as a local improvement strategy for M-estimators defined in (5.7). In Section 9.1 
we show that the objective function decreases in each iteration, and thus this method 
leads to a local minimum of (5.7). In what follows, we describe iterative procedures 
for S- and LTS-estimators that also lead to local minima. These work by decreasing 
the value of the objective function at each step. 


5.7.1.1 Local improvements for S-estimators 


Recall that S-estimators can be thought of as M-estimators (see p. 127), and thus a 
simple approach to find a local minimum that satisfies the first-order conditions (5.8) 
is to adapt the IRWLS algorithm described in Section 4.5.2 by updating o at each 
step. In other words, if B, is the estimator at the kth iteration, the scale estimator on 


is obtained by solving 
- r(B, 
Do os) = 6 (5.41) 
on 


with the method of Section 2.8.2. Then B,. 41 18 obtained by weighted least squares, 
with weights w; = W(r;/6;,). It can be shown that if W is decreasing then 6, decreases 
at each step (see Salibian-Barrera and Yohai (2006) and Section 9.2). Since comput- 
ing G consumes an major proportion of the computation time, it is important to do 
it economically, and in this regard one can, for example, start the iterations for 6; 
at the previous value G,_,. Other strategies are discussed in Salibian-Barrera and 
Yohai (2006). 


5.7.1.2 Concentration steps for the LTS estimator 


A local minimum of (5.25) can be attained iteratively using the “concentration step” 
(C-step) of Rousseeuw and van Driessen (2000). Given a candidate B 1» let B be the 
LS estimator based on the data corresponding to the 4 smallest absolute residuals. It 
is proved in Section 9.3 that the trimmed L-scale 6 given by (5.25) is not larger for 
B than for B ,- This procedure i is exact, in the sense that after a finite number of steps 
it attains a value of B such that further steps do not decrease the values of G. It can 
be shown that this B is a local minimum, but not necessarily a global one. 
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5.7.2 Starting values: the subsampling algorithm 


The idea behind subsampling is to construct random candidate solutions to (5.17) 
using the sample points. More specifically, we construct candidates by fitting the 
model to randomly chosen subsets of the data. Since, intuitively, a good candidate 
should adjust well the clean portion of the sample, we need to find well-conditioned 
subsets of non-outlying points in order to obtain good starting points. 

To compute the candidate solutions, we take subsamples J of size p from the data: 


{(x,y) :i€J}, JC{l,..n}, #) =p. 


For each set J we find the vector f, that satisfies the exact fit x/B, = y;, fori € J. Ifa 
subsample is collinear, it is discarded and replaced by another. Since considering all 
(" ) subsamples would be prohibitive unless both n and p are rather small, we choose 
N of them at random: {J; : k = 1,...,N}. The initial value for the improvement steps 
is then taken to be the candidate that produces the best value of the object function: 
B Jig? where 


ky = arg min 6(r(B,,)). (5.42) 


We can now apply the algorithms in Section 5.7.1 to B y,, 10 obtain a local minimum, 
0 


which is taken as our estimator B. 
Note that an alternative procedure would be to apply the local improvement iter- 
ations to every candidate B, , obtaining their corresponding local minimizers B;, 


k =1,...,N, and then select as our estimator B the one with the best objective func- 
tion. Although the resulting estimator will certainly not be worse (and will proba- 
bly be better) than the previous one, its computational cost will be much higher. In 
Section 5.7.3 we discuss a better intermediate option. 

Since the motivation for the subsampling method is to construct a good initial can- 
didate by fitting a clean subsample of the data, one may need to take a large number 
N of random subsamples in order to find a clean one with high probability. Specif- 
ically, suppose the sample contains a proportion € of outliers. The probability of an 
outlier-free subsample is a = (1 — €)’, and the probability of at least one “good” sub- 
sample is 1 — (1 — a). If we want this probability to be larger than 1 — y, we must 
have 

Iny > NIn(l — a) = —Na 


and hence 
[In y| _ _|iny| 


~ (nd -(-e)|  (—eP 


(5.43) 


for p not too small. Therefore N must grow approximately exponentially with p. 
Table 5.3 gives the minimum N for y =0.01. Since the number of “good” 
(outlier-free) subsamples is binomial, the expected number of good samples is Na, 
and so for y = 0.01 the expected number of “good” subsamples is |In 0.01| = 4.6. 
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Table 5.3. Minimum AN for 6 = 0.01 
Dp e=0.1 0.15 0.20 0.25 0.50 


5 4 5 8 12 101 
10 8 15 28 56 3295 
15 14 35 90 239 105475 
20 25 81 278 1013 3.38 x 10° 
30 74 420 2599 18023 3.46 x 10° 
40 216 2141 24214 320075 3.54 x 10!? 


50 623 10882 225529 5.68 x 10° 3.62 x 10 


When this method is applied to an S-estimator, the following observation saves 
much computing time. Suppose we have examined M — 1 subsamples and Gy_, is 
the current minimum. Now we draw the Mth subsample, which yields the candidate 
estimator By. We may avoid the effort of computing the new scale estimator Gy in 
those cases where it will turn out to not be smaller than G,,_,. The reason is as follows. 
If Gy < Gy_,, then since p is monotonic 


n= yp (2) > Yo (M). 
M 


i=l i=1 Om-1 


Thus if 


né < ye (ee), (5.44) 


Oy- 


we may discard By since Gy > Gy_,. Therefore G is computed only for those sub- 
samples that do not verify condition (5.44). 

Although the N given by (5.43) ensures that the approximate algorithm has the 
desired BP in a probabilistic sense, it does not imply that it is a good approximation 
to the exact estimator. Furthermore, because of the randomness of the subsampling 
procedure, the resulting estimator is stochastic; that is, repeating the computation 
may lead to another local minimum and hence to another B. with the unpleasant 
consequence that repeating the computation may yield different results. In our expe- 
rience, a carefully designed algorithm usually gives good results, and the above, 
infrequent but unpleasant, effects can be mitigated by increasing N as much as the 
available computing power will allow. 

The subsampling procedure may be used to compute an approximate LMS 
estimator. Since total lack of smoothness precludes any kind of iterative improve- 
ment, the estimator is simply taken as the raw candidate B y, With smallest objective 
function. Usually this is followed by one-step reweighting (Section 5.9.1), which, 
besides improving the efficiency of the estimator, makes it more stable with respect 
to the randomness of subsampling. It must be recalled, however, that the resulting 
estimator is not asymptotically normal, and hence it is not possible to use it as a 
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basis for approximate inference on the parameters. Since the resulting estimator 
is a weighted LS estimator, it would be intuitively attractive to apply classical LS 
inference as if these weights were constant, but this procedure is not valid. As was 
explained in Section 5.9.1, reweighting does not improve on the estimator’s order of 
convergence. 


5.7.3 A strategy for faster subsampling-based algorithms 


With LTS or S-estimators, which allow iterative improvement steps as described in 
Section 5.7.1, it is possible to dramatically speed up the search for a global mini- 
mum. In the discussion below, “iteration” refers to one of the two iterative procedures 
described in that section (although the method to be described could be applied to 
any estimator that admits of iterative improvement steps). Consider the following 
two extreme strategies for combining the subsampling and the iterative parts of the 
minimization: 


A Use the “best” result (5.42) of the subsampling as a starting point from which to 
iterate until convergence to a local minimum. 

B Iterate to convergence from each of the N candidates 6; and keep the result with 
smallest c. 


Clearly strategy B would yield a better approximation of the absolute minimum than 
A, but is also much more expensive. An intermediate strategy, which depends on two 
parameters K;,., and Kyo), consists of the following steps: 


1. Fork = 1,...,N, compute B ij and perform K;,., iterations, which yields the candi- 


dates B, with residual scale estimators 6;. 

2. Only the K,.., candidates with smallest Keep estimators 6, are kept in storage, 
only needing to be updated when the current 6, is lower than at least one of the 
current best Kj,,, values. Call these estimators Bj), k = 1, .., Kxeep- 


3. For i= 1, .., Keep, iterate to convergence starting from Bq), obtaining the candi- 
date 6, with residual scale estimator 6,. 
4. The final result is the candidate 6, with minimum 6,. 


Option A above corresponds to Kj... = 0 and Ky.o) = 1, while B corresponds to Ki... = 
co and Ky.) = 1. This strategy was first proposed by Rousseeuw and van Driessen 
(2000) for the LTS estimator (Fast LTS) with Kite, = 2 and Keep = 10. As mentioned 
above, the general method can be used with any estimator that can be improved 
iteratively. 

Salibian-Barrera and Yohai (2006) proposed a Fast S-estimator, based on this 
strategy. A theoretical study of the properties of this procedure seems impossible, but 
their simulations show that it is not worthwhile to increase Kj... and Kee, beyond 


iter 


values of 1 and 10, respectively. They also show that N = 500 gives reliable results at 
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least for p < 40 and contamination fraction up to 10%. Their simulation also shows 
that Fast S is better than Fast LTS with respect both to mean squared errors and the 
probability of converging to a “wrong” solution. The simulation in Salibian-Barrera 
and Yohai (2006) also indicates that Fast LTS works better with Kj... = 1 than with 
Kiter = 2. 

A further saving in time is obtained by replacing G in step 1 of the procedure by 
an approximation obtained by one step of the Newton—Raphson algorithm starting 
from the normalized median of absolute residuals. 

Ruppert (1992) proposes a more complex random search method. However, the 
simulations by Salibian-Barrera and Yohai (2006) show that its behavior is worse 
than that of both the Fast S and the Fast LTS estimators. 


5.7.4 Starting values: the Pefia-Yohai estimator 


The MM-estimator requires an initial estimator, for which we have chosen an 
S-estimator. Computing an S-estimator in turn requires initial values to start the 
iterations, which forces the user to use subsampling. While this approach is practical 
for small p, Table 5.3 shows that the number N,,, of required subsamples to obtain 
good candidates with high-probability increases rapidly with p, which makes the 
procedure impractical for large values of p. Of course, one can employ smaller values 
of N,,, than those given in the table, but then the estimator becomes unreliable. We 
shall now show a different approach that yields much better results than subsampling. 
It is based on a procedure initially proposed by Pefia and Yohai (1999) for outlier 
detection, but which in fact yields a fast and reliable initial regression estimator. 

Let (X, y) be a regression dataset, with X € R’?. Call r; the residuals from the 
LS estimator, and for j = 1, .., call = the LS estimator computed without obser- 


vation j, and call rij) = y; — ne the ith residual using the LS estimator with 
observation j deleted. Then it can be shown that 
hr; 
ee a 7% (5.45) 

where hi is the (i, 7) element of the “hat matrix” H = X(X’ X)~!X’, defined in (4.29); 
see, for example, Belsley et al. (1980). 

Define the sensitivity vectors r; € R” with elements ry = Nips i,j = 1,..,n. Then 
r; expresses the sensitivity of the prediction of y, to the deletion of each observation. 
We define the sensitivityt matrix R as the n X n matrix with rows i It follows from 
(5.45) that R= HW, where W is the diagonal matrix with elements r;/(1 — h,;). We 
can consider R as a data matrix with n “observations” of dimension n. Then we 
can try to find the most informative linear combinations of the columns of R using 
principal components. Let v,,...v,, be the eigenvectors of 


P=R’R= WH’ W = WHW 
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corresponding to the eigenvalues A; > A, >... > 4,. Since Phas rank p, only the 
first p eigenvalues are nonnull. Then the only informative principal components are 


t= l1,sa Dp: 
Pefia and Yohai (1999) show that a more convenient way to compute the z;s is as 
Zz, = X(X'X)!/?X'u,, i=1,...,p, 


where u,,..., U 


p are the eigenvectors of the p x p matrix 


Q =(X'X)7!/?X’WX(X'x) |? 


corresponding to the eigenvalues 4, 2 A, 2... 2 A,. Note that in this way we only 
need to compute the eigenvectors of the p x p matrix Q instead of those of the n x n 
matrix P. These “sensitivity principal components” z,, ...,Z, will be used to identify 
outliers. A set of natural candidates for the initial estimator will be found using the LS 
estimators obtained after deleting different sets of outliers found in this process. The 
initial estimator for the MM estimator will be the candidate whose residuals have a 
minimum M-scale. In what follows, we describe in detail the iterative procedure that 
will be used to obtain the initial estimator. 

In each iteration k we obtain a set A, of 3p + 1 candidates for estimating B. Then, 
in this iteration, we select the candidate B™ as 


p“ = arg min S(y — XB) 
BEA, 


and let 
© — mi xp 
S eae S(y — Xp) 


where S is an M-scale with breakdown point equal to a given a. Now we will describe 
how to compute the sets A;. Call Zi (i = 1,, .n) the coordinates of Zj and put m = [an]. 
Iteration 1 The set A, includes the LS estimator, the ZL, estimator and, for each 
j, 1 <j <p, it also includes the LS estimator obtained after deleting the obser- 
vations corresponding to the m largest z,;, the m smallest zj;, and the m largest 
IZjal> 
Iteration k + 1 Suppose now that we have already completed iteration k. Then the 
residuals r® = ce coe py = y — Xp are computed and all observations 
j such that 


ait 


are deleted, where C is a given constant. The remaining observations are used to 
compute new sensitivity principal components Z;, ...z,,. The set A, is computed 
using the the same procedure that for A). 


146 LINEAR REGRESSION 2 


The iterations end when peo = B®. Ultimately, the initial estimator is defined 
by B = B®”, where 

ky = arg min s®, 

The recommended values are a = 0.5 and C = 2. Pefia and Yohai (1999) show 
that this estimator has breakdown point a for mass-point contamination, and sim- 
ulations in the same paper show it to yield reliable results under different outlier 
configurations. 


5.7.5 Starting values with numeric and categorical predictors 


Consider a linear model of the form: 
yj =x) Bi +x), By + uj, i= | eee (5.46) 


where the x,; € R?! are 0-1 vectors, such as a model with some categorical variables 
as in the case of the example in Section 1.4.2, and the x5; € R’? are continuous random 
variables. The presence of the continuous variables would make it appropriate to 
use an MM estimator, since a monotone M-estimator would not be robust. However, 
the presence of the 0-1 variables is a source of possible difficulties with the initial 
estimator. If we employ subsampling, in an unbalanced structured design, there is a 
high probability that a subsampling algorithm yields collinear samples. For example, 
if there are five independent explanatory dummy variables that take the value | with 
probability 0.1, then the probability of selecting a noncollinear sample of size 5 is 
only 0.011! In any event the sub-sampling will be a waste if p» « p,. If we instead 
employ the Pefia—Yohai estimator, the result may be quite unreliable. 

Our approach is based on the idea that if one knew B, (respectively B,) in (5.46), 
it would be natural to use a monotone M-estimator (S-estimator) for the parameter B, 
(the parameter f,). To carry out this idea we recall the well-known procedure to per- 
form a bivariate regression through univariate regressions: first “orthogonalize” one 
of the predictors with respect to the other, and then compute the univariate regression 
of the response with respect to each of the predictors. 

Let M(X, y) be a monotone regression M-estimator such as the L; estimator. The 
first step is to “remove the effect of X, from X, and y”. Let 


X, =X, -—X,T, andy =y-X,t, (5.47) 


where t = M(X,, y) and the columns of T € R?!*?2 are the regression vectors of the 
columns of X, on X;. 

Let By = = arg ming, SG — Xp By — XB, (Bp), where B (By) = M(X,, ¥ — Xp). 
In other words, Bo is the _S- estimator obtained after adjusting for the 0-1 vari- 
ables. The estimators (Bi, Bo): are obtained by “back- transforming” from (5.47), so 
that XB, + Xp = =X,p,+ X, Bo, which yields B, = By - TB>. 

To compute Bo above we use Pefia—Yohai candidates based on x and Y. For each 
of these candidates, we compute B, as above and select the candidate that yields 
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the smallest M-scale of the residuals. In our numerical experiments we found that 
it is useful to further refine this best Pefia—Yohai candidate employing the iterations 
described in Section 5.7.1.1. Finally, we use (Bi. B) and its associated residual scale 
estimator as the initial point to compute an MM-estimator. 

Maronna and Yohai (2000) proposed a similar idea, but employing subsampling 
for the S-estimator. Koller (2012) proposed a subsampling method for these mod- 
els that avoids collinear subsamples. A naive procedure would be to just apply the 
Pefia—Yohai estimator to X. These alternatives were compared to the one proposed 
here by a thorough simulation study, which demonstrated that the proposed method 
outperforms its competitors. 

Rousseeuw and Wagner (1994), and Hubert and Rousseeuw (1996, 1997) have 
also proposed other approaches for models with categorical predictors. 


Example 5.4 Each row of a dataset (from Hettich and Bay, 1999) is a set of 90 
measurements at a river in Europe. There are 11 predictors. The first three are cate- 
gorical: the season of the year, river size (small, medium and large) and fluid velocity 
(for low, medium and high). The other 8 are the concentrations of chemical sub- 
stances. The response is the logarithm of the abundance of a certain class of algae. 
The tables and figures for this example are obtained with script algae.R. 


Figures 5.14 and 5.15 are the normal Q—Q plots of the residuals corresponding 
to the LS estimator and to the MS-estimator described above. The first gives the 
impression of short-tailed residuals, while the residuals from the robust fit indicate 
the existence of least two outliers. 


Normal Q-Q Plot 


Least Squares Residuals 


Theoretical Quantiles 


Figure 5.14 Algae data: normal Q—Q plot of LS residuals 
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Normal Q-Q Plot 


Robust Residuals 


Theoretical Quantiles 


Figure 5.15 Algae data: normal Q-Q plot of robust residuals 


Example 1.3 (continued) In the multiple linear regression in Section 1.4.2, the 
response variable was the rate of unemployment and the predictor variables were PA, 
GPA, HS, GHS, Region and Period. The last two are categorical variables with 22 and 
2 parameters respectively, while the other predictors are continuous variables. The 
estimator used for that example was the MS-estimator. Figures 1.4 and 1.5 revealed 
that for these data the LS estimator found no outliers at all, while the MS-estimator 
found a number of large outliers. In this example three of the LS and MS-estimator 
t-statistics and p-values give opposite results using 0.05 as the level of the test, as 
shown in Table 5.4. 


Table 5.4 Unemployment data: results from LS and 
MS estimators 


Variable Estimate t value p value 
Region 20 MS — 1.0944 0.2811 
LS —3.0033 0.0048 
HS MS 1.3855 0.1744 
LS 2.4157 0.0209 
Period2 MS 2.1313 0.0400 


LS 0.9930 0.3273 
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For the “Region 20” level of the “Region” categorical variable and the HS 
variables, the LS fit declares these variables as significant while the robust fit 
declares them insignificant. The opposite is the case for the Period 2 level of the 
Period categorical variable. This shows that outliers can have a large influence on 
the classical test statistics of an LS fit. 


5.7.6 Comparing initial estimators 


We shall now compare two choices for the initial estimator of an MM-estimator with 
bisquare p. The first starts from an S-estimator computed with the strategy described 
in Section 5.7.2 and with N,,,, subsamples (henceforth “S-Sub” for brevity). The sec- 
ond is to start from the Pefa—Yohai estimator (henceforth “P-Y”’). 

For large p, the choice of the number of subsamples N,,,, for the former has to be 
somewhat arbitrary, for if we wanted to ensure even a breakdown point of 0.15, the 
values of N,,, given by Table 5.3 for n > 40 would make the computation unfeasible. 

Table 5.5 compares the computing times (in seconds) of both estimators for p 
between 10 and 100 and n = 10p. It is seen that the latter is much faster, and the 
difference between the estimators increases with p, despite the relatively low values 
of Noup- 

The left-hand half of Table 5.6 shows the finite-sample efficiencies of both estima- 
tors. It is seen that those of MM based on P-Y are always higher than those obtained 
from the S-estimator, and are reasonably close to the nominal one; the differences 
between both estimators are clearer when n = 5p. The last two columns show the 
respective maximum MSEs under 10% point-mass contamination, and it it seen that 
employing P-Y yields much lower values. In conclusion, using P-Y as initial esti- 
mator makes MM faster, more efficient and more robust. Another important feature 
of P-Y is that unlike subsampling it is deterministic, and therefore yields totally 
reproducible results. For all these reasons we recommend its routine use instead of 
subsampling. 


Table 5.5 Computing times (in seconds) for the bisquare 
MM.-estimator with S-Sub and P-Y starting points 


P n Noub S-Sub Pp-Y 


10 100 1000 0.01 0.01 
20 200 2000 0.06 0.05 
30 300 3000 0.58 0.16 
50 500 4000 7.68 0.71 
80 800 4000 37.32 3.33 


100 1000 4000 114.33 7.74 
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Table 5.6 Efficiencies and Max MSEs of MM-estimator 
with S-Sub and P-Y starting points 


Efficiencies Max MSE 

D n S-Sub P-Y S-Sub P-Y 
10 50 O:727 0.827 1.63 0.98 
100 0.808 0.855 0.71 0.57 

200 0.838 0.857 0.47 0.44 

20 100 0.693 0.834 1.51 0.96 
200 0.789 0.849 0.71 0.60 

400 0.825 0.851 0.49 0.45 

50 250 0.682 0.862 2.10 1.06 
500 0.791 0.859 0.77 0.52 

1000 0.828 0.856 0.62 0.42 

80 400 0.685 0.874 5.10 0.96 
800 0.791 0.863 3.22 0.64 

1600 0.827 0.857 2.41 0.49 


5.8 Balancing asymptotic bias and efficiency 


Defining the asymptotic bias of regression estimators requires a measure of the “size” 
of the difference between the value of an estimator, which for practical purposes 
we take to be the asymptotic value Boo and the true parameter value B. We shall 
use an approach based on prediction. Consider an observation (x, y) from the model 
(5.1)-(5.2): 

y=x’B+u, xandu independent. 


The prediction error corresponding to Bx is 
e=y-x'B,, =u-x'(B, — B). 
Let Ev? = 0? < co, Eu = 0, and V,= Exx’. Then the mean squared prediction error is 
Ee? = 07 + (Bao — B)' Vx(Boo ~ B): 


The second term is a measure of the increase in the prediction error due to the param- 
eter estimation bias, and so we define the bias as 


dB.) = V Boo — BY Vx(Boo — B). (5.48) 


Note that if 6 is regression, scale and affine equivariant, this measure is invariant 
under the respective transformations; that is, b(B,,) does not change when any 
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of those transformations is applied to (x,y). If V, is a multiple of the identity — 
such as when the elements of x are i.i.d. zero mean normal — then (5.48) is a 
multiple of ||6,, — B||, so in this special case the Euclidean norm is an adequate bias 
measure. 

Now consider a model with intercept; that is, 


== [2] = [21] 


and let w = Ex and U = Exx’, so that 


1 , 
V,= a 
uweoU 
For a regression and affine equivariant estimator, there is no loss of generality in 
assuming pf = 0 and fi, = 0, and in this case 


~A 4 Al A 
bBo)” = Po.c0 + By co UB) 00: 


with the first term representing the contribution to bias of the intercept and the second 
that of the slopes. 

A frequently used benchmark for comparing estimators is to assume that the joint 
distribution of (x, y) belongs to a contamination neighborhood of a multivariate nor- 
mal. By the affine and regression equivariance of the estimators, there is no loss of 
generality in assuming that this central normal distribution is N,,, , (0, I. In this case it 
can be shown that the maximum biases of M-estimators do not depend on p. A proof 
is outlined in Section 5.13.5. The same is true of the other estimators treated in this 
chapter, except for GM-estimators, as set out in Section 5.11.1. 

The maximum asymptotic bias of S-estimators can be derived from the results of 
Martin et al. (1989), and those of the LTS and LMS estimators from Berrendero and 
Zamar (2001). Table 5.7 compares the maximum asymptotic biases of LTS, LMS 
and the S-estimator with bisquare scale and three MM-estimators with bisquare p, 


Table 5.7 Maximum bias of regression estimators for contamination € 


E 
0.05 0.10 0.15 0.20 Eff. 
LTS 0.63 1.02 1.46 2.02 0.07 
S-E 0.56 0.88 1.23 1.65 0.29 
LMS 0.53 0.83 1.13 1.52 0.0 
MM (global) 0.78 1.24 1.77 2.42 0.95 
MM (local) 0.56 0.88 1.23 1.65 0.95 


MM (local) 0.56 0.88 1.23 1.65 0.85 
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in all cases with asymptotic BP equal to 0.5, when the joint distribution of x and y 
is in an €-neighborhood of the multivariate normal N P (0, I). One MM-estimator 
is given by the global minimum of (5.29) with normal distribution efficiency 0.95. 
The other two MM-estimators correspond to local minima of (5.29) obtained using 
the IRWLS algorithm starting from the S-estimator, with efficiencies 0.85 and 0.95. 
The LMS estimator has the smallest bias for all the values of € considered, but also 
has zero asymptotic efficiency. It is remarkable that the maximum biases of both 
“local” MM-estimators are much lower than those of the “global” MM-estimator, and 
close to the maximum biases of the LMS estimator. This shows the importance of a 
good starting point. The fact that an estimator obtained as a local minimum starting 
from a very robust estimator may have a lower bias than one defined by the absolute 
minimum was pointed out by Hennig (1995), who also gave bounds for the bias of 
MM-estimators with general p-functions in contamination neighborhoods. 

It is also curious that the two local MM-estimators with different efficiencies 
have the same maximum biases. To understand this phenomenon, we show in 
Figure 5.16 the asymptotic biases of the S- and MM-estimators for contamination 
fraction ¢ = 0.2 and point contamination located at (x), Kxq) with x) = 2.5, as a 
function of the contamination slope K. It is seen that the bias of each estimator is 
worse than that of the LS estimator up to a certain value of K and then drops to zero. 
But the range of values where the MM-estimator with efficiency 0.95 has a larger 
bias than the LS estimator is greater than those for the 0.85 efficient MM-estimator 
and the S-estimator. This is the price paid for a higher normal efficiency. The 
MM-estimator with efficiency 0.85 is closer in behavior to the S-estimator than 


Blas 
1.0 


Slope 


Figure 5.16 Biases of LS, S-estimator, and MM-estimators with efficiencies 0.85 
and 0.95, as a function of the contamination slope, for ¢ = 0.2, when xy = 2.5 
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the one with efficiency 0.95. If one makes similar plots for other values of xg, one 
finds that for x) > 5 the curves for the S-estimator and the two MM-estimators are 
very similar. 

The former results, on MM-estimators, suggest a general approach for the choice 
of robust estimators. Consider in general an estimator @ with high efficiency, defined 
by the absolute minimum of a target function. We have an algorithm that, starting 
from any initial value 0p, yields a local minimum of the target function, that we shall 
call A(@)). Assume also that we have an estimator 6* with lower bias, although possi- 
bly with low efficiency. Define a new estimator 0 (call it the “approximate estimator’) 
as the local minimum of the target function obtained by applying the algorithm start- 
ing from 6*; that is, 0= A(0*). Then, in general, @ has the same efficiency as A) 
under the model, while it has a lower bias than 4 in contamination neighborhoods. 
If, in addition, 6” is fast to compute, then @ will also be faster than 6. An instance of 
this approach in multivariate analysis will be seen in Section 6.8.5. 


5.8.1 “Optimal” redescending M-estimators 


In Section 3.5.4 we gave the solution to the Hampel dual problems for a 
one-dimensional parametric model, namely: 


e finding the M-estimator minimizing the asymptotic variance subject to an upper 
bound on the gross-error sensitivity (GES), and 
e minimizing the GES subject to an upper bound on the asymptotic variance. 


This approach cannot be taken with regression M-estimators with random predictors 
since (5.13) implies that the GES is infinite. However, as we now show, it is possible 
to modify it in a suitable way. 

Consider a regression estimator B= (fo. B,). where Bo corresponds to the inter- 
cept and B , to the slopes. Yohai and Zamar (1997) showed that for an M-estimator B 
with bounded p, the maximum biases MB(e, Bo) and MB(e, B ,) in an €-contamination 
neighborhood are of order fe . Therefore, the biases of these estimators are contin- 
uous at zero, which means that a small amount of contamination produces only a 
small change in the estimator. Because of this, the approach in Section 3.5.4 can be 
adapted to the present situation by replacing the GES with a different measure called 
the contamination sensitivity (CS), which is defined as 


és MB(e, B; 
CS(B;) = lim —S (j = 0, 1). 


E 


Recall that the asymptotic covariance matrix of a regression M-estimator depends 
on p only through 
E-(w(u)) 


,F) = 
= el Gy 
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where y = p’ and F is the error distribution. We consider only the slopes B,, which 
are usually more important than the intercept. The analogues of the direct and dual 
Hampel problems can now be stated as finding the function y that 


e minimizes u(y, F’) subject to the constraint CSB D<k, 
or 
e minimizes CS(B,) subject to u(y, F) < kb, 


where k, and k, are given constants. 
Yohai and Zamar (1997) found that the optimal y for both problems has the form: 


-2edre)" 
p(\ul) 


where ¢ is the standard normal density, c is a constant and ft = max(t, 0) denotes the 
positive part of t. For c = 0 we have the LS estimator: y(u) = u. 

Table 5.8 gives the values of c corresponding to different efficiencies, and 
Figure 5.17 shows the bisquare and optimal y-functions with efficiency 0.95. We 
observe that the optimal y increases almost linearly and then redescends much faster 
than the bisquare y. This optimal y-function is a smoothed, differentiable version 
of the hard-rejection function y(u) = ul(|u| < a) for some constant a. As such, it is 
not only good from the numerical optimization perspective, but also has the intuitive 
feature of making a rather rapid transition from its maximum absolute values to zero 
in the “flanks” of the nominal normal distribution. The latter is a region in which it is 
most difficult to tell whether a data point is an outlier or not: outside that transition 
region, outliers are clearly identified and rejected, and inside it. data values are left 
essentially unaltered. As a minor point, the reader should note that (5.49) implies 
that the optimal y has the curious feature of vanishing completely in a small interval 
around zero. For example, if c = 0.013 the interval is (—0.032, 0.032), which is so 
small it is not visible in the figure. 

Svare et al. (2002) considered the two optimization problems stated above, but 
used the actual maximum bias MB(e, Bi) for a range of positive values of € instead of 
the approximation given by the contamination sensitivity CS(B,). They calculated the 
optimal y and showed numerically that for € < 0.20 it is almost identical to the one 
based on the contamination sensitivity. Therefore the optimal solution corresponding 


yw (u) = sent( (5.49) 


Table 5.8 Constants for optimal estimator 


Efficiency 0.80 0.85 0.90 0.95 
c 0.060 0.044 0.028 0.013 
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Figure 5.17 Optimal (—) and bisquare (....) psi-functions with efficiency 0.95 


to an infinitesimal contamination is a good approximation to the one corresponding 
to e > 0, at least for € < 0.20. 

The results of simulations in the next section show an unexpected property of 
this estimator, namely that under a very high nominal efficiency, such as 0.99, its 
behavior under contamination can be approximately as good as that of the bisquare 
estimator with a much lower efficiency. This fact can be explained by the form of its 
y-function, as shown in Figure 5.17, which coincides with that of the LS estimator 
up to a certain point, and then drops to zero. 


5.9 Improving the efficiency of robust regression 
estimators 


5.9.1 Improving efficiency with one-step reweighting 


We have seen that estimators based on a robust scale cannot have both a high BP 
and high normal efficiency. As we have already discussed, one can obtain a desired 
normal efficiency by using an S-estimator as the starting point for an iterative pro- 
cedure leading to an MM-estimator. In this section we consider a simpler alternative 
procedure, proposed by Rousseeuw and Leroy (1987), to increase the efficiency of 
an estimator Bo without decreasing its BP. 

Let o be a robust scale of r(Bo). say, the normalized median of absolute values 
(4.47). Then compute a new estimator B. defined as the weighted LS estimator of 
the dataset with weights w,; = W(r,Bo) /G), where W(t) is a decreasing function of 
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|t|. Rousseeuw and Leroy proposed the “weight function” W be chosen as the “hard 
rejection” function W(t) = I(|t| < k), with k equal to a y-quantile of the distribution 
of |x|, where x has a standard normal distribution; for example, y = 0.975. Under 
normality, this amounts to discarding a proportion of about | — y of the points with 
largest absolute residuals. 

He and Portnoy (1992) show that, in general, such reweighting methods 
preserve the order of consistency of Bo: so in the standard situation where By 
is 1/n-consistent, then so is B. Unfortunately, this means that because the LMS 
estimator is n!/3-consistent, so is the reweighted LMS estimator. 

In general, the reweighted estimator B is more efficient than Bo: but its asymptotic 
distribution is complicated (and more so when W is discontinuous), and this makes it 
difficult to tune it for a given efficiency; in particular, it has to be noted than choosing 
y = 0.95 for hard rejection does not make the asymptotic efficiency of B equal to 0.95: 
it continues to be zero. A better approach for increasing the efficiency is described in 
Section 5.9.2. 


5.9.2 A fully asymptotically efficient one-step procedure 


None of the estimators discussed so far can achieve full efficiency at the normal dis- 
tribution and at the same time have a high BP and small maximum bias. We now 
discuss an adaptive one-step estimation method due to Gervini and Yohai (2002), 
which attains full asymptotic efficiency at the normal error distribution and at the 
same time has a high BP and small maximum bias. It is a weighted LS estimator 
computed from an initial estimator Bo with high BP, but rather than deleting the val- 
ues larger than a fixed k, the procedure will keep a number N of observations (x,, y,) 
corresponding to the smallest values of t; = Ir,(Bo)| /¢,i=1,...,n, where N depends 
on the data, as will be described below. This N has the property that in large samples 
under normality it will have V/n — 1, so that a vanishing fraction of data values will 
be deleted and full efficiency will be obtained. 

Call G the distribution function of the absolute errors |u;|/o under the normal 
model; that is, 

GY) = 20() — 1 = Pilz] < 9, 


with x ~ ®, which is the standard normal distribution function. Let 1) <... < tn) 
denote the order statistics of the t;. Let 7 = G~'(y), where y is a large value such as 
y = 0.95. Define 


F : i-1 
ig = min{i: ty) =n}, =min (=), (5.50) 
0 0) oD \ Clin) 
and 
N=I[q] (5.51) 


where [.] denotes the integer part. The one-step estimator is the LS estimator of the 
observations corresponding to t;;, fori < N. 
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We now justify this procedure. The intuitive idea is to consider as potential out- 
liers only those observations whose ¢,, are not only greater than a given value, but 
also sufficiently larger than the corresponding order statistic of a sample from G. 
Note that if the data contain one or more outliers, then in a normal Q—Q plot of the 
i, against the respective quantiles of G, some large t/;, will appear well above the 
identity line, and we would delete it and all larger ones. The idea of the proposed 
procedure is to delete observations with large #,;, until the Q—Q plot of the remaining 
ones remains below the identity line, at least for large values of |t/;)|. Since we are 
interested only in the tails of the distribution, we consider only values larger than some 
given 7. 

More precisely, for N <n, call Gy the empirical distribution function of 
fay S++ Sty: 


1 
Gy(t) = vito < th. 


It follows that _ 
iS 
Gy(t) = “ae for Ui-1) < b< a) 


and hence each ¢ in the half-open interval [7(;_), (;)) is an @;-quantile of Gj with 


The a;-quantile of G is G~'(@;). Then we look for N such that for ig < i < N the 
a,-quantile of Gy, is not larger than that of G; that is 


for i€[ip,N] 2 tay <t<t = t<G! (—) = G< —— 
(5.52) 
Since G is continuous, (5.52) implies that 
Gt) < — for iy Si<N. (5.53) 
Also, since ; 
Soi sis 


the restriction i < N may be dropped in (5.53), which can be seen to be equivalent to 
j—1 


wee 
(tay) 


foriz>ip —> Ng (5.54) 


with q defined in (5.50). We want the largest N < gq, and since N is an integer, we 
ultimately get (5.51). 

Gervini and Yohai show that under very general assumptions on B and G, and 
regardless of the consistency rate of B. these estimators attain the maximum BP and 
full asymptotic efficiency for normally distributed errors. 
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5.9.3. Improving finite-sample efficiency and robustness 


Most theoretical results on the efficiency of robust estimators deal with their asymp- 
totic efficiency — that is, the ratio between the asymptotic variances of the maximum 
likelihood estimator (MLE) and of the robust estimator — because it is much easier 
to deal with than the finite-sample efficiency, defined as the ratio between the mean 
squared errors (MSE) of the MLE and of the robust estimator, which is the one that 
really matters. However, unless the sample size is large enough, the finite-sample 
efficiency may be smaller than the asymptotic one. The last section dealt with a pro- 
cedure for improving the asymptotic efficiency. Here we present a general approach 
proposed by Maronna and Yohai (2014), called distance-constrained maximum like- 
lihood (DCML), which yields both a high finite-sample size efficiency and high 
robustness even for small n, and which may be applied to any parametric model. 

Maronna and Yohai (2010) and Koller and Stahel (2011) have proposed partial 
solutions for this problem. A proposal by Bondell and Stefanski (2013) yields very 
high efficiencies, but at the cost of a serious loss of robustness. 


5.9.3.1 Measuring finite-sample efficiency and robustness 


Before presenting the DCML estimator, some definitions are necessary. The MSE of 
an univariate estimator was defined in (2.3). We now have to extend this concept to 
a regression estimator B € R’. Consider a sample Z = { (x;, y;),7 = 1,...,2} from the 
linear model (5.1) y; = x; Bo + u;, with u; independent of x; and Ev; = 0. The MSE 


of an estimator B of Bo, based on Z, is defined as 
MSE(B) = E(B — Bo) V.(B — Bo) 


where V, = Exx’. 
This definition is based on a prediction approach. Consider an observation Z) = 
(Xp, Yo), independent of Z, following the same model; that is, yg = x/ Bo + upg. The 


prediction error of yg using B is 
e= eo — xB = Ug — xi(B — Bo). 
Put o? = Ev? and X ={x,, ...,X,,}. Then the mean squared prediction error is 
Ee? = 67 + E(B - Bo)'XoXp(B — Bo) + 2Euyxy(B — Bo). 


The last term vanishes because of the independence of ug from the other elements. 
As to the second term, the independence between X and Xg yields 


E(B — Bo)’ Xox4(B — Bo) = E{EL(B — Bo)’Xoxh(B — Bo) XI} 
= E(B — By)'V,(B — Bo), 


and therefore 2 [ 
Ee” = 0” + E(B — By)’ V,(B — Bo). 
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The first term is the same for all estimators, while the second can be considered 
as a squared norm of the difference between the estimator and the true value, with 
respect to the quadratic form V,. This norm is a more suitable measure of that dif- 
ference than the Euclidean norm. In particular it is invariant, in the sense that if we 
replace x; by Tx;, where T is any nonsingular matrix, then the MSE remains the same. 

Then the finite-sample efficiency of B is defined as 


eff(B) = MSE(Pme) (5.55) 


MSE(@) 


While the asymptotic bias and variances can be in some cases calculated analyt- 
ically, the finite-sample efficiency has to be computed through simulation. 

We now deal with measuring robustness. Call MSE,(B) the MSE when the joint 
distribution of (x;, y;) is F. Given a “central model” Fo (say, a model with normal 
errors), we want to see how the MSE varies when F is in some sense “near” Fy. To 
this end, consider a neighborhood of size € of Fo, F (Fo, €), as in (3.3). As a measure 
of the “worst” that can happen we take the maximum MSE in the neighborhood, and 
so define 

MaxMSE(, Fy. €) = sup{MSE,(B) : F € F(Fy.8)}.- (5.56) 


Again, there is no analytical way to calculate MaxMSE, and therefore one must 
resort to simulation. The simplest choice for the family of contaminating distributions 
G is the one of point-mass distributions Oiesys ): Since this is an infinite set, one lets 
(Xp, Yo) range over a grid of values in order to approximate (5.56). 


5.9.3.2 The DCML estimator 


To define the DCML estimator proposed by Maronna and Yohai (2014), we need 
an initial robust estimator, not necessarily with high finite-sample efficiency. Then 
the estimators are defined by maximizing the likelihood function subject to the esti- 
mator being sufficiently close to the initial one. Doing so, we can expect that the 
resulting estimator will have the maximum possible finite-sample efficiency under 
the assumed model compatible with proximity to the initial robust estimator. This 
proximity guarantees the robustness of the new estimator. 

The formulation of this proposal is as follows. Let D be a distance or discrepancy 
measure between densities. As a general notation, given a family of distributions with 
observation vector z, parameter vector @ and density f(z, @), put 


d(O,, 0) = D(f(Z, 91), f(Z, 9) . 


Let z;,i = 1,..,n be i.i.d. observations with distribution f(z, @), and let 6p be an initial 
robust estimator. Call L(z,, ...,Z,,;@) the likelihood function. Then the proposal is to 
define an estimator @ as 

0) with d(@),.0) <6, (5.57) 


n? 


6= arg max L(Z), Peay) 
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where 6 is an adequately chosen constant that may depend on n. We shall call this 
proposal the “distance-constrained maximum likelihood’ (DCML). 

Several dissimilarity measures, such as the Hellinger distance, may be employed 
for this purpose. We shall employ as D the Kullback—Leibler (KL) divergence, 
because, as will be seen, this yields easily manageable results. The KL divergence 
between densities f, and f, is defined as 


Deft) = | log 


—o 


(Be iw dz, (5.58) 


and therefore the d in (5.57) will be 


dx (O;, 95) = / 


—o 


y (= 01) 
10) 
f(z, 3) 


We now tailor our analysis to a linear model with random predictors. Consider 
the family of distributions of z=(x, y), where x € R’ and y € R, satisfying the model 


) F934 (5.59) 


y=x'B+ou, (5.60) 


where u ~ N(0, 1) is independent of x € R’. Here 0 = (B,o). Let 65 = (Bo: Gy) be 
an initial robust estimator of regression and scale. We will actually consider o as a 
nuisance parameter, and therefore we have 


di (By: B) = <6 — By) V(B — Bo) (5.61) 


withV = Exx’. 

For B € R’, the residuals from f are denoted r,(B) = y; — x’B. Since o is 
unknown, we replace it with its estimator 6). The natural estimator of V would be 
V =n"'X’X, where X is the n x p matrix with rows xi. 

Put, for any positive semidefinite matrix U, 


Get (Bo: B) = ab — Bo UB — Bo) (5.62) 


0 


It is immediately clear that (5.57) with d = dnee is equivalent to minimizing 
YL, 7B)" subject to dx; (Bo, B) < 6. Call Bs the LSE. Put, fora general matrix U: 


Ay= dxyu(Bo; ys). 


Define B as the minimizer of )Y_, r 
the solution is explicit: 


(B)? subject to dat ¢(Bo; B) < 6. In this case, 
B= 1B,s + (1— Bo, (5.63) 


where ¢ = min(1, \/6/Ag). We thus see that B is a convex combination of Bo 
and Bix: 
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Table 5.9 Finite-sample efficiencies of MM, Gervini—Yohai and DCML estimators 
Dp n MM-Bi85 G-Y MM-Op99 DCML-Bi85 DCML-Op99 


10 50 0.80 0.85 0.95 0.99 1.00 
100 0.82 0.89 0.97 0.99 1.00 

200 0.84 0.93 0.99 1.00 1.00 

20 100 0.79 0.86 0.96 0.99 1.00 
200 0.82 0.92 0.97 0.99 1.00 

400 0.84 0.95 0.99 1.00 1.00 

50 250 0.78 0.88 0.95 1.00 1.00 
500 0.82 0.94 0.98 1.00 1.00 

1000 0.84 0.97 0.99 1.00 1.00 


Since Vv is not robust, we replace it with the matrix V, defined in (5.3), and there- 
fore we choose 
i= min | : (5.64) 
Ay 


It is easy to show that if Bo is regression- and affine-equivariant, so is B. 

We have to choose 6 in (5.57). We do so for the case when Bo is the MM-estimator 
with efficiency 0.85 and og is the residual M-scale. Then 6 is chosen as 

Syn = 0.36. (5.65) 

To justify (5.65) note that under the model, the distribution of ndgy (Bos Bis) is 
approximately that of vz, where z ~ eo and v is some constant, which implies that 
Edy: (Bo: Bis) ~ up/n. Therefore, in order to control the efficiency of B. it seems 
reasonable to take 6 of the form Kp/n for some K. The value K = 0.3 was arrived 
at after exploratory simulations aimed at striking a balance between efficiency 
and robustness. The behavior of the estimator is not very sensitive to the choice 
of the constant K; in fact, one may choose K between, say, 0.25 and 0.35 without 
serious effects. 

It can be shown that for the estimators studied here, the finite-sample replacement 
BP of the DCML estimator B is at least that of the initial estimator Bo- 

Table 5.9 shows the results of a simulation comparing the efficiencies of 


e the MM-estimator with bisquare p and asymptotic efficiency 0.85 (MM-Bi85); 

e the MM-estimator with optimal p (Section 5.8.1) and asymptotic efficiency 0.99 
(MM-Op99); 

e the Gervini-Yohai (G-Y) estimator defined in Section 5.9.2 (with asymptotic 
efficiency 1); 
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e the DCML estimator with bisquare p starting from MM-Bi85; 
e the DCML estimator with optimal p starting from MM-Op99; 


for a linear model with normal errors, p normal predictors, and an intercept. The 
initial estimator employed for the MM and G-Y is described in Section 5.7.4. 

It is seen that DCML-Op99 has the highest efficiency, followed in order by 
DCML-Bi85, MM-Op99, G-Y and MM-Bi85. 

It should be noted that while the asymptotic efficiency of DCML does not depend 
on the predictors’ distribution, the finite-sample efficiency does. For this reason 
Maronna and Yohai (2014) consider several other distributions for X with different 
degrees of heavy-tailedness, the results being similar to the ones shown here. 

We now deal with the estimators’ robustness. Table 5.10 gives the maximum 
mean squared errors (MSE) of the estimators, for 10% contaminated data with nor- 
mal predictors. Again, DCML-Op99 is the best in all cases, closely followed by 
DCML-Bi85 and MM-Op99, which have similar performances, and then G-Y and 
MM-Bi85. It seems strange that the estimators with highest efficiency have the low- 
est MSEs. The reason is that the MSE is composed of bias and variance; and while, for 
example, MM-Bi85 may have a smaller bias than MM-Op99, it has a larger variance. 

In addition, MM-Ops with efficiencies lower than 0.99 have larger MSEs than 
MM-Op99, and for this reason their results were not shown. 

The DCML estimator uses as initial estimator an MM-estimator, which in turn is 
based on an S-estimator. It would be natural to ask why we do not use the S-estimator 
directly for DCML. The explanation is that since the S-estimator has a low efficiency, 
in order for DCML to have high efficiency, the value of 6 must be much higher than 
(5.65), which greatly reduces the estimator’s robustness. This fact has been confirmed 
by simulations. 


Table 5.10 Maximum MSEs of MM, Gervini—Yohai and DCML estimators for 
10% contamination 


P n MM-Bi85  G-Y MM-Op99  DCML-Bi85 DCML-Op99 


10 50 1.10 1.03 0.89 0.92 0.86 
100 0.55 0.48 0.46 0.45 0.44 
200 0.44 0.37 0.37 0.36 0.36 
20 100 1.28 1.16 0.98 1.02 0.94 
200 0.58 0.49 0.47 0.47 0.45 
400 0.46 0.38 0.39 0.39 0.38 
50 250 1.38 1.18 1.03 1.04 0.96 
500 0.62 0.51 0.49 0.49 0.47 


1000 0.52 0.42 0.41 0.43 0.40 
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5.9.3.3 Confidence intervals based on the DCML estimator 


The asymptotic distribution of the DCML estimator, derived in Maronna and Yohai 
(2014), is a very complicated non-normal distribution. For this reason, we propose 
a simple heuristic method to approximate the estimator’s distribution with a normal 
one. We take the value ¢ in (5.63) as if it were fixed. The asymptotic normal distri- 
bution of the resulting estimator can be calculated. Although there is no theoretical 
justification, simulations shows that the coverage probabilities of the resulting confi- 
dence intervals are a good approximation to the nominal one. 

The joint asymptotic distribution of Bo and Biz can be derived from (3.49), and 
from this a straightforward calculation shows that the asymptotic covariance matrix 
of the estimator (5.63) under model (5.60) is 


oa 2 5 208 -1 


a= Ey(u)’, b=Ey(u), c= Ew(wu, a = Ev’?, C=Exx’. 


U 


where 


Then U can be estimated by 


ef Gay @ 
iS) 2 rh gegen ae cy. 
RB b 


x 2 x 
x i r(Bo) oh ri(Bo) 
a= 1D = ; oie y’ 4 ; 
i=1 2) i=l a) 


Bas 1 ' (fe) r(Bo)s a = SB)”. 
n oO 


oO 


where 


where S is an M-scale, normalized as shown at the end of Section 2.5. Actually, the 
natural estimators of a, would be n~! Yy_, r(Bo)? but then a large residual would 
yield overly large intervals for heavy-tailed errors. 

Given U, the asymptotic variance of a linear combination y = a 'B can be esti- 
mated by n-!a’Ua. 

Table 5.11 shows the results of a simulation for a model with ten normal pre- 
dictors plus intercept considering two error distributions: normal and Student with 
three degrees of freedom. The values are the average covering probabilities (for all 
11 parameters) of the approximate confidence interval for elements of 6 computed as 
in (4.26), with nominal levels 0.90, 0.95 and 0.99. We see that the actual coverages 
are reasonably close to the nominal ones. 

Smucler and Yohai (2015) use a different approach to obtain regression estima- 
tors that have simultaneously high efficiency and robustness for finite samples. The 
performance of this procedure is comparable to that of the DCML estimator. 
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Table 5.11 Actual coverage probabilities of DCML 
intervals for p = 10 


Levels 
Errors n 0.90 0.95 0.99 
Normal 50 0.92 0.96 0.99 
100 0.92 0.96 0.99 
Student 3-DF 50 0.88 0.93 0.98 


100 0.86 0.92 0.98 


5.9.4 Choosing a regression estimator 


In Sections 5.4, 5.5 and 5.9 we considered several families of estimators. Estimators 
based on a residual scale are robust, but have low efficiency. MM-estimators can give 
(at least approximately) a given finite-sample efficiency without losing robustness. 
Both versions of DCML have simultaneously a higher efficiency and a lower MSE 
than the MMs, as seen in Tables 5.9 and 5.10. Also, the optimal p outperforms the 
bisquare p. Overall, DCML-Op99 appears the best in both efficiency and robustness, 
closely followed by DCML-Op85. 

As to the initial estimators, Tables 5.5 and 5.6 show that the Pefa—Yohai estimator 
outperforms subsampling in speed, efficiency and robustness. For these reasons we 
recommend the following procedure: 


. Compute the Pefia- Yohai estimator. 

. Use it as starting estimator for an MM-estimator with optimal p and 99% effi- 
ciency. 

3. Use this MM as initial estimator for the DCML estimator with optimal p. 


Noe 


5.10 Robust regularized regression 


In this section we deal with robust versions of regression estimators that have better 
predictive accuracy than ordinary least squares (OLS) in situations where the predic- 
tors are highly correlated and/or the number of regressors p is large compared to the 
number of cases n, and even when p > n. The situation of prediction with p >> n has 
become quite common, especially in chemometrics. One instance of such a situation 
appears when attempting to replace the laboratory determination of the amount of a 
given chemical compound in a sample of material, with methods based on cheaper 
and faster spectrographic measurements. These estimators work by putting a “penal- 
ty” on the coefficients, which produces a bias but at the same time decreases the 
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variances of the estimators. These estimators are known as regularized regression 
estimators. 

Consider a regression model with intercept (4.5) with data (x;, y,), i= 1,...,n, 
where we put x =i, x) with x, € R?-! and B = (Bo, By and call X the n x (p — 1) 
matrix with ith row equal to x, Then then we can define a general class of regularized 
regression estimators by 


n aA Ar 
B=(f,B,)'’ =arg min L(p), (5.66) 
fo ER.B | ER?! 


where L is of the form 


: n r(B) p-1 
L(p) =s° }' p ( ) +A} h(B)), (5.67) 
i=l j=1 


S 
J 


where r;(B) = y; — B’x;, p is a loss function, s is a residual scale and h is a mono- 
tone function penalizing large coefficients. The parameter / determines the severity 
of the penalization and is usually determined by K— fold cross validation, described 
as follows. A set A of candidate A’s is chosen; the data are split into K subsets of 
approximately equal size; for each A € A, K — 1 of the subsets are used in turn to 
compute the estimator, and the remaining one to compute the prediction errors. The 
set of n prediction errors for this A is summarized by the MSE or a robust version 
thereof, and the / yielding the smallest MSE is chosen. 

When robustness is not an issue, p is usually chosen as a quadratic function: 
p(r) = r’, and in this case L(B) does not depend on s. 

For an overview of this topic, see the book by Biihlmann and van de Geer (2011). 


5.10.1 Ridge regression 


The simplest and oldest of regularized estimators is the so-called ridge regression 
(henceforth RR), first proposed by Hoerl and Kennard (1970). The RR estimator is 
given by (5.66) and (5.67), with p(r) = r?, h(B) = f* and s = 1; that is, 


B= (h.B,)' = arg min LB) (5.68) 
PoER, B, ERP! 
with i 
n p= 
LB) = Dy r7(B) +4 YB. (5.69) 
i=l j=l 


Ridge regression is particularly useful when it can be assumed that all coefficients 
have approximately the same order of magnitude (unlike “sparse” situations. when 
most coefficients are assumed to be null); see the comments on RR by Zou and Hastie 
(2005) and Frank and Friedman (1993). 

Since RR is based on least squares, it is sensitive to atypical observations. 
Several approaches have been proposed to make RR robust towards outliers, 
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the most interesting ones being the ones by Silvapulle (1991) and Simpson and 
Montgomery (1996). These estimators, however, present two drawbacks. The first 
is that they are not sufficiently robust towards “bad leverage points”, especially 
for large p. The second is that they require an initial (standard) robust regression 
estimator. When p > n this becomes impossible, and when p < n but the ratio p/n is 
“large’’, the initial estimator has a low robustness, as measured by the BP. 

In the next subsection we present a robust version of the RR estimator proposed 
by Maronna (2011). This can also be used when p > n, is robust when p/n is large, 
and is resistant to bad leverage points. 


5.10.1.1 MM-estimation for ridge regression 


To ensure both robustness and efficiency, under the normal model we employ the 
approach of MM-estimation (Yohai 1987). Start with an initial robust but possibly 
inefficient estimator Bini from the respective residuals compute a robust scale esti- 
mator G;,;- Then compute an M-estimator with fixed scale G;,;, starting the iterations 
from pe and using a loss function that ensures the desired efficiency. Here “effi- 
ciency” will be loosely defined as “similarity with the classical RR estimator for the 
normal model”. 
Let f;,; be an initial estimator and let G;,,,; be an M-scale of r: 


1 1>e (age? ) 65. (5.70) 


Cini 


where pp is a bounded p-function and 6 is to be chosen. Then the MM-estimator for 
RR (henceforth RR-MM) is defined by (5.68) with 


i(B) 


ini 


L(B) = 65, (2 ) + AIBA’, (5.71) 
i=1 
where p is another bounded p-function such that p < po. The factor a , before the 
summation is employed to make the estimator coincide with classical RR when 
p(t) = ¢. 
We shall henceforth use 


po(t) = Pris (+) » (0 = pr (=) (5.72) 
0 


where p,;, denotes the bisquare p-function (2.38) and the constants cy < c are chosen 
to control both robustness and efficiency. 

Call e* the breakdown point (BDP) of 6;,;. It is not difficult to show that 
if A>0, the BDP of RR-MM is >e*. However, since the estimator is not 
regression-equivariant, this result is deceptive. As an extreme case, take p =n — 1 
and e* = 0.5. Then the BDP of B is null for A = 0, but is 0.5 for A = 107°! Actually, 
for a given contamination rate less than 0.5, the maximum bias of the estimator 
remains bounded, but the bound tends to infinity when A tends to zero. 
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The classical ridge regression estimator given by (5.68) and (5.69) will henceforth 
be denoted as RR-LS for brevity. 


5.10.1.2 The iterative algorithm 


Note that the loss function (5.69) is the same as the OLS loss for the augmented 


dataset (X,Y) with 
i 1, x ~ | y 
X= | 0 3 y= F |: (5.73) 


where, as a general notation, I, 1, and 0, are respectively the p-dimensional identity 
matrix and the p-dimensional vectors of ones and of zeroes. Then the classical esti- 
mator RR-LS satisfies the “normal equations” for the augmented sample. This may 
be written 


y-x6,, (%/X+41,)6, = X'(y—fol,), (5.74) 


>> 
Net 


where 


A similar system of equations is satisfied by RR-MM. Define, as in (2.31), yw) = 
p'(t) and W(t) = w(t)/t. Let 


W(t) 
t= oe 


r 
—. i= 

L n~ ? L ? 
Oni 2 


W=(W,...,W,), W = diag(w). (5.75) 


Setting the derivatives of (5.71) with respect to f to zero yields for RR-MM 


w'(y— Bol, — XB)) =0 (5.76) 


and 
(X/WX + MB = X’/Wy - Bol,,). (5.77) 


Therefore, RR-MM satisfies a weighted version of (5.74). Since for the chosen 
p, W(t) is a decreasing function of |r|, observations with larger residuals will receive 
lower weights w;. 

As is usual in robust statistics, these “weighted normal equations” suggest an 
iterative procedure. Starting with an initial B : 


e Compute the residual vector r and the weights w. 
e Leaving By and w fixed, compute B, from (5.77). 
e Recompute w, and compute Bo from (5.76). 
e Repeat until the change in the residuals is small enough. 

Recall that G;,,; remains fixed throughout. This procedure may be called “iterative 
reweighted RR”. It can be shown that the objective function (6.46) descends at 
each iteration. 
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5.10.1.3 The initial estimator for RR-MM 


As in the standard case, the initial estimator for RR-MM will be the one based on the 
minimization of a (penalized) robust scale (an S-estimator). We define an S-estimator 
for RR (henceforth RR—-SE) by replacing in (5.67) the sum of squared residuals by a 
squared robust M-scale 6 (2.49), namely 


L(B) = no(r(B))* + Ally Il’. (5.78) 


where r(B) = (7;(f),..-,7,(B)). Here the factor n is used to make (5.78) coin- 
cide with (5.67) when 67(r) = ave;(r?), corresponding to p(t) = ? and 6=1 
in (2.49). 

A straightforward calculation shows that the estimator satisfies the “normal 
equations” (5.76)-(5.77), with w; = W(t;) and A replaced by 4’ = An“! YY, wt? 
These equations suggest an iterative algorithm similar to the one at the end of 
Section 5.10.1.2, with the difference that now the scale changes at each iteration; that 
is, now in the definition of ¢; in (5.75), G,,; is replaced by G(r(B)). Although it has not 
been possible to prove that the objective function descends at each iteration, this fact 
has been verified in all cases up to now. The initial estimator for SE-RR employs 
the procedure proposed by Pefia and Yohai (1999) and described in Section 5.7.4, 
but applied to the augmented sample (5.73). 


5.10.1.4 Choosing / through cross—validation 


In order to choose A we must estimate the prediction error of the regularized estima- 
tor for different values of A. Call CV(K) the K-fold cross validation process, which 
requires recomputing the estimator K times. For K = n (“leave one out’) we can use 
an approximation to avoid recomputing. Call )_,; the fit of y; computed without using 


i ; : “-i) BD xi) BD. ’ 
the ith observation; that is, y_; = Bo + xB 1 > Where (By ,B, +) is the estima- 
tor computed without observation i. Then a first-order Taylor approximation of the 
estimator yields the approximate prediction errors. 

For this method to work properly, several technical details must be taken into 
account. They are omitted here for brevity, and are explained in the paper by Maronna 
(2011). An application of robust RR to functional regression is given by Maronna and 
Yohai (2013). 


5.10.2 Lasso regression 


One of the shortcomings of RR is that it cannot be employed in “sparse” situations, 
where it is assumed that a large part of the coefficients are null, since, although RR 
shrinks the coefficients toward zero, it does not produce exact zeros, and is thus not 
useful to select predictors. 
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To remedy this inconvenience, Tibshirani (1996) proposed a new choice for the 
penalization function / in (5.67). He introduced the /asso (least absolute shrinkage 
and selection operator) regression estimator, which is defined as: 


n p-l 


L(B) = ¥' r°(B) + AY 1B) (5.79) 
j=l 


i=1 


In other words, L(f) is of the form (5.67) with p(r) = r? and h() = |p|. It can be 
shown that using lasso regression, the number of null coefficients increases when A 
increases. 

Zou (2006) proposed a two-step modification of ne lasso procedure called the 
adaptive lasso. In the first step, an estimator B© = (Be ) p,... vs a) (for example, 
a lasso estimator) is computed, and in the second step the adaptive lasso estimator is 
obtained as 

ey Ibe 


By =arg min a (py) t+A 2 14 97 (5.80) 


i=1 


(with the convention that 0/0 = 0). Thus the coefficients that were small in the first 
step receive a higher penalty in the second. It also follows from (5.80) that the coef- 
ficients that are null in the first step are also null in the second. 

There are two main algorithms to compute the lasso: least angle regression (Efron 
et al., 2004) and the coordinate descent algorithm (Friedman et al., 2007). The first 
one is computationally similar to stepwise variable selection, while the second takes 
one coefficient f; at a time to minimize the loss function. 


5.10.2.1 Robust lasso 


Since lasso regression uses a quadratic loss function, this estimator is very sensitive to 
outliers. There are several proposals in which the quadratic loss function is replaced 
by a convex p function, such as Huber’s p function (2.28) (Li et al., 2011) or p(r) = 
|r|. (Wang et al., 2007). However, since these p functions are not bounded, these 
estimators are not robust. Alfons et al. (2013) present a lasso version of the trimmed 
least squares (LTS) estimator that was defined in Section 5.4.2. This estimator is 
robust, but very inefficient under Gaussian errors. Khan et al. (2007) propose a robust 
version of least angle regression, but they employ it as a tool for model selection rather 
than as an estimator. 

Smucler and Yohai (2017) propose the MM lasso estimator, which is obtained by 


minimizing ; 
n Dp 
Lip) = 8, (2) ea Dim 
i=! 
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where p is a bounded p-function and 6;,,, is an initial scale, as in the case of MM-RR. 


Differentiating with respect to B yields the estimating equations 


Liv (7) xy + asientgy 0.1 <j <p-I 


Cini 
> v () =o 


=I Cini 


i=1 


2” 


where “S, means that the left-hand side changes sign when f; crosses zero. This is 


equivalent to the system of equations 


n 


> r*(B)x;, + 467 sign(B) = 0,1 <j<p-1 
i=! 
pS r*(B) =0 (5.81) 
i=! 
where 
HB =yi- Bx; yeswiy, xf = wx, 
and w, is as given in (5.75). This system of equations is very similar to that of lasso 
regression for the sample (x}, y}), -..-(X;,Y;,) and penalization parameter lore ,_ Then, 


to compute the MM lasso, one can use the following iterative algorithm: 


Compute the initial estimator B© and 6;,; as in the case of MM-RR by the 
Pefia—Yohai procedure. 

e Once 6“ is computed, compute the weights w;, as in (5.75), using Bp = B®; 
compute the transformed sample (x;, y*), 7 = 1,..n, and obtain B* by solving 
the lasso problem for this sample with penalization parameter AG? ‘ 

Stop when [|B — B||/||B ll <e. 


More details of the algorithm can be found in Smucler and Yohai (2017). 
An adaptive MM lasso estimator is defined by minimizing the loss function 


ww (HB), IB 
Up) =argmin Yo (4 Jey u 


e 
i=l ini jal IB: | 


This estimator can be computed using an approach similar to the one used for the 
MM-lasso, using a simple transformation of the x;;s (see Problem 5.11). 

The penalization parameter A may be chosen by cross validation. For this purpose, 
a grid of equally spaced candidates is chosen in the interval [0,A,,,,], where A,,,, iS a 
value large enough so that the lasso estimator has all the components equal to 0. For 


each candidate, a robust scale of the residuals is computed, and the candidate with 
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minimum scale is chosen. We can, for example, use an M-scale or a t-scale. More 
details on the algorithm and the cross validation procedure can be found in the paper 
by Smucler and Yohai (2017). 


5.10.2.2 Oracle properties 


Consider a sparse linear regression model with 6, = (B,;.f8 2), where B,;, € R% and 
Bi. =9,, with m = p—q — 1. It is said that an estimator B of B has an oracle-type 
property if has behavior similar to that of the estimator computed assuming that 
Bi = 9; that is, when the true submodel is known. This property can be formal- 
ized in different ways: some approaches look at the behavior of the estimator itself, 
and others at the prediction accuracy that can be achieved with the estimator. We will 
use here the characterization of Fan and Li (2001). According to their definition, a 
regularized estimator of B= (fo. B lb Bi) of B has the oracle property if it satisfies: 


ba Him, .o.P(Bi2 = 0) =0 
© 1/28) B11) >? Nox1(0,2"), 


where &* is the asymptotic covariance of the non-penalized estimator assuming that 
Bi. =0. 

It was proved by Smucler and Yohai (2017) that if A is chosen depending of n so 
that n'/?4 — 0, then, under very general conditions, these properties are satisfied by 
the adaptive MM lasso. In this case, the asymptotic distribution of n!/ (Bo, B 11) is the 
same as if the MM-estimator without penalization were applied only to the first p — 1 
predictors. 


5.10.3. Other regularized estimators 


It can be proved that the lasso estimator has at most min(7, p) nonnull components. 
When pis much larger than n this may be a limitation for selecting all the relevant 
variables. To overcome this limitation, Zou and Hastie (2005) proposed the so-called 
elastic net estimator, which is defined by minimizing 


n p-l p-1 


L(B) = Y)77(B) + Ay DY IB + 22D) BP. (5.82) 
j=l j=l 


i=1 


A robust version of this estimator can also be obtained through the MM approach. 

In some cases there are groups of covariables that have to be selected or rejected 
in a block. This happens, for example, when there is a categorical variable that is 
coded as k — | binary covariables, where k is the number of categories. In that case, 
one would like either to select all the variables or to reject all the variables in the 
group; that is, to make either all of the coefficients or none of the coefficients in the 
group equal to zero. Yuan and Lin (2006) proposed an approach called group lasso 
to deal with this case. 
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5.11 *Other estimators 


5.11.1 Generalized M-estimators 


In this section we treat a family of estimators which is of historical importance. The 
simplest way to robustify a monotone M-estimator is to downweight the influential 
x; to prevent them from dominating the estimating equations. Hence we may define 
an estimator by 


vv (“ iP a) x, W(d(x,)) = (5.83) 
i=l 

where W is a weight function and d(x) is some measure of the “largeness” of x. Here 
y is monotone and G is simultaneously estimated by an M-estimating equation of the 
form (5.18). For instance, to fit a straight line y; = fy + £,x; + €;, we may choose 


d(x) = Ho! (5.84) 
oO 


ca 


where ji, and G, are respectively robust location and dispersion statistics of the x;, 
such as the median and MAD. In order to bound the effect of influential points, W 
must be such that W(f)t is bounded. 

More generally, we may let the weights depend on the residuals as well as the 
predictor variables and use a generalized M-estimator (GM-estimator) B defined by 


ya («0 @)s x, = 0, (5.85) 


i=] 


where for each s, (s,r) is a nondecreasing and bounded w-function of r, and G is 
obtained by a simultaneous M-scale estimator equation of the form 


ly ri(B)\ _ 
n D, Pest () = 6. 


Two particular forms of GM-estimator have been of primary interest in the literature. 
The first is the estimator (5.83), which corresponds to the choice y(s, r) = W(s)y(r) 
and is called a “Mallows estimator” (Mallows, 1975). The second form is the choice 


n(s,r) = ver (5.86) 


which was first proposed by Schweppe et al. ; 1970) in the context of electric power 
systems. 

The GM-estimator with the Schweppe function (5.86) is also called the 
“Hampel—Krasker—Welsch” estimator (Krasker and Welsch, 1982). When y is 
Huber’s y;,, it is a solution to Hampel’s problem (Section 3.5.4). See Section 5.13.6 
for details. 

Note that the function d(x) in (5.84) depends on the data, and for this reason it 
will be better to denote it by d,,(x). The most usual way to measure largeness is as a 
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generalization of (5.84). Let #i,, and = be a robust location vector and robust scatter 
matrix; these will be looked at in more detail in Chapter 6. Then d,, is defined as 


d(x) =(x— pi, YE, (x—i,). (5.87) 


In the case where ji, and =. are the sample mean and covariance matrix, 1/d,,(x) is 
known as the Mahalanobis distance. 

Assume that ji,, and &,, converge in probability to w and & respectively. With 
this assumption, it can be shown that if the errors are symmetric, then the influence 
function of a GM-estimator for the model (5.1)—(5.2) is 


Yo — XB 
IF((Xp, Yo), F) = («os — B-'x, (5.88) 
with 3 
B=—Ey (aw, “) al, dpe 2 (5.89) 
oO or 
and 


d(x) = (x— p)'Z7'(x— p). 


Hence the IF is the same as would be obtained from (3.48) using d instead of d,,. 
It can be shown that # is asymptotically normal, and as a consequence of 5.88, the 
asymptotic covariance matrix of f is 


B-'cB-! (5.90) 


with 


_ ,’/ 2 
C=Ey (a0o, = *) xx’. 
o 
It follows from (5.88) that GM-estimators have several attractive properties: 


e If 7(s,r)s is bounded, then their IF is bounded. 

e The same condition ensures a positive BP (Maronna et al., 1979). 

e They are defined by estimating equations, and hence easy to compute like ordinary 
monotone M-estimators. 


However, GM-estimators also have several drawbacks: 


e Their efficiency depends on the distribution of x: if x is heavy-tailed they cannot 
be simultaneously very efficient and very robust. 

e Their BP is less than 0.5 and is quite low for large p. For example, if x is multi- 
variate normal, the BP is O(p-!/ ?) (Maronna et al., 1979). 

e A further drawback is that the simultaneous estimation of o reduces the BP, espe- 
cially for large p (Maronna and Yohai, 1991). 


For these reasons, GM-estimators, although much examined in the literature, are 
not a good choice, except perhaps for small p. See also Mili et al. (1996) and Pires 
et al. (1999). 
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5.11.2 Projection estimators 


Note that the residuals of the LS estimator are uncorrelated with any linear com- 
bination of the predictors. In fact, the normal equations (4.13) imply that for any 
X.€ R?’, the LS regression of the residuals r; on the projections ’x; is zero, since 
yi 7;4’x; = 0. The LS estimator of regression through the origin is defined for 
Z=(Z,,..-,%,)’ and y = (y,...,y,)’ as 


and it follows that the LS estimator B satisfies 
b(XA,r(B)) =O V VER’, 10. (5.91) 


A robust regression estimator could be obtained by (5.91) using for b a robust 
estimator of regression through the origin. But in general it is not possible to obtain 
equality in (5.91). As a result, we must content ourselves with making b “as small 
as possible”. Let 6 be a robust scale estimator, such as 6(z) = Med(|z|). Then the 
projection estimators for regression (““P-estimators”) proposed by Maronna and Yohai 
(1993) are defined as 


n 


B = arg min (maxiocxa, nA |@x , (5.92) 


which means that the residuals are “as uncorrelated as possible” with all projections. 
Note that the condition 4 # 0 can be replaced by |||| = 1. The factor ¢(XA) is needed 
to make the regression estimator scale equivariant. 

The “median of slopes” estimator for regression through the origin is defined as 
the conditional median 


b(x, y) = Med ( ces 
Xj 


x # 0) (5.93) 


Martin et al. (1989) extended Huber’s minimax result for the median (Section 3.8.5) 
showing that (5.93) minimizes asymptotic bias among regression invariant estima- 
tors. Maronna and Yohai (1993) studied P-estimators with b given by (5.93), which 
they called MP estimators, and found that their maximum asymptotic bias is lower 
than that of MM- and S-estimators. They have an n~!/? consistency rate, but are not 
asymptotically normal, which makes their use difficult for inference. 

Maronna and Yohai (1993) show that if the x; are multivariate normal, then the 
maximum asymptotic bias of P-estimators does not depend on p, and is not larger 
than twice the minimax asymptotic bias for all regression equivariant estimators. 

Numerical computation of P-estimators is difficult because of the nested opti- 
mization in (5.92). An approximate solution can be found by reducing the searches 
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over B and i to finite sets. A set of N candidate estimators B,,k = 1,..., N is obtained 
by subsampling, as in Section 5.7.2, and from these the candidate directions are com- 


puted as 
Bi — By 


ina. 
«1B; — Bill 


J 


j#k 
Then (5.92) is replaced by 
B = argmin (mgsto0x9, rBn16(X3,)) . 


The resulting approximate estimator is regression and affine equivariant. In principle 
the procedure requires N(N — 1) evaluations, but this can be reduced to O(N log N) 
by the trick described in Maronna and Yohai (1993). 


5.11.3 Constrained M-estimators 


Mendes and Tyler (1996) define constrained M-estimators (CM-estimators for short), 


as in (4.36) 
0.0) memin | 10 (2) ne} (5.94) 


i=1 


1 ie (2) <e (5.95) 


where p is a bounded p-function and € € (0, 1). Note that if p is bounded, (5.94) 
cannot be handled without restrictions, for then o > 0 would yield a trivial solution. 

Mendes and Tyler show that CM-estimators are M-estimators with the same 
p. Thus they are asymptotically normal, with a normal distribution efficiency that 
depends only on p (but not on €), and hence the efficiency can be made arbitrarily 
high. Mendes and Tyler also show that for a continuous distribution the solution 
asymptotically attains the bound (5.95), so that G is an M-scale of the residuals. It 
follows that the estimator has an asymptotic BP equal to min(e, 1 — €), and taking 
€ = 0.5 yields the maximum BP. 


with the restriction 


5.11.4 Maximum depth estimators 


Maximum regression depth estimators were introduced by Rousseeuw and 
Hubert (1999). Define the regression depth of B € R? with respect to a sample 


(x,, y) as 


d(B) = 1 nip { a <0,Nx, # a} (5.96) 
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where 4 € R?. Then the maximum depth regression estimator is defined as 


n 


fp =arg ms d(B). (5.97) 


The solution need not be unique. Since only the direction matters, the infimum in 
(5.96) may be restricted to {||A|| = 1}. Like the P-estimators of Section 5.11.2, max- 
imum depth estimators are based on the univariate projections ’x;, of the predictors. 
In the case of regression through the origin, B coincides with the median of slopes 
given by (5.93). But when p > 1, the maximum asymptotic BP of these estimators at 
the linear model (5.1) is 1/3, and for an arbitrary joint distribution of (x, y) it can only 
be asserted to be > 1/(p + 1). 

Adrover ef al. (2002) discuss the relationships between maximum depth and 
P-estimators. They derive the asymptotic bias of the former and compare it to that of 
the MP-estimators defined in Section 5.11.2. Both biases turn out to be similar for 
moderate contamination (in particular, the GESs are equal), while the MP-estimator 
is better for large contamination. They define an approximate algorithm for comput- 
ing the maximum depth estimator, based on an analogous idea already studied for 
the MP-estimator. 


5.12 Other topics 


5.12.1 The exact fit property 


The so-called exact fit property states essentially that if a proportion @ of observations 
lies exactly on a subspace, and | — a@ is less than the BP of a regression and scale 
equivariant estimator, then the fit given by the estimator coincides with the subspace. 
More precisely, let the FBP of B be e* = m*/n, and let the dataset contain q points 
such that y; = XY for some y. We prove in Section 5.13.3 that if g >n—m* then 
B = y. For example, in the location case, if more than half the sample points are 
concentrated at xj, then the median coincides with xg. In practice if a sufficiently 
large number g of observations satisfy an approximate linear fit y; © x y for some y, 
then the estimator coincides approximately with that fit: B XY. 

The exact fit property implies that if a dataset comprises two linear substructures, 
an estimator with a high BP will choose to fit one of them, and this will allow the 
other to be discovered through the analysis of the residuals. A nonrobust estimator 
such as LS will instead try to make a compromise fit, with the undesirable result that 
the existence of two structures passes unnoticed. 


Example 5.5 To illustrate this point, we generate 100 points lying approximately 
on a Straight line with slope = 1, and another 50 points with slope = —1. The tables 
and figures for this example are obtained with script ExactFit.R. Figure 5.18 shows 
the fits corresponding to the LS and MM-estimators. It is seen that the latter fits the 
bulk of the data, while the former does not fit any of the two structures. 


OTHER TOPICS 177 


4 


Xx 


Figure 5.18 Artificial data: fits by the LS and MM-estimators 


5.12.2 Heteroskedastic errors 


The asymptotic theory for M-estimators, which includes S- and MM-estimators, has 
been derived under the assumption that the errors are i.i.d. and hence homoskedastic. 

These assumptions do not always hold in practice. When the y; are time series or 
spatial variables, the errors may be correlated. Moreover, in many cases the variability 
of the error may depend on the explanatory variables; in particular, the conditional 
variance of y given x may depend on f’x = E(y|x). 

In fact, the assumptions of independent and homoskesdastic errors are not neces- 
sary for the consistency and asymptotic normality of M-estimators; it can be shown 
that these properties hold under much weaker conditions. Nevertheless we can men- 
tion two problems: 


e The estimators may have lower efficiency than others that take into account the 
correlation or heteroskedasticy of the errors. 

e The asymptotic covariance matrix of M-estimators may be different from vV;! 
given in (5.14), which was derived assuming i.i.d. errors. Therefore, the estima- 
tor V, given in Section 5.6 would not converge to the true asymptotic covariance 
matrix of B. 


We deal with these problems in the next two subsections. 


5.12.2.1 Modelling heteroskedasticity for M-estimators 


To improve the efficiency of M-estimators under heteroskedasticity, the dependence 
of the error scale on x should be included in the model. For instance, we can replace 
model (5.1) by 

y; = B’x; + W(A, B’x)u,, 
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where the u; are i.i.d. and independent of x;, and A is an additional vector parameter. 
In this case the error scales are proportional to h?(A, B’x). 
Observe that if we knew /A(A, B’x), then the transformed variables 


* Ji x X; 
— x. = —_— 
ha psy! AO BX) 


would follow the homoskedastic regression model 
yy = px? + u;. 
This suggests the following procedure to obtain robust estimators of 6 and i. 


(i) Compute an initial robust estimator Bo for homoskedastic regression; for 
example, Bo may be an MM-estimator. 
(ii) Compute the residuals (Bo): 
(iii) Use these residuals to obtain an estimator 4 of 2. For example, if 


h(a, 1) = exp(A, + Ay|t[), 


then A can be estimated by a robust linear fit of log(|r,(By)!) on Box; 

(iv) Compute a robust estimator for homoskedastic regression based on the trans- 
formed variables 

* Ji # Xx; 

Dj ae a a ee 
A(A, Box) h(A, Box) 


Steps (i)-(iv) may be iterated. 

Robust methods for heteroskedastic regression have been proposed by Carroll 
and Ruppert (1982) who used monotone M-estimators; by Giltinan et al. (1986) who 
employed GM-estimators; and by Bianco ef al. (2000) and Bianco and Boente (2002) 
who defined estimators with high BP and bounded influence starting with an initial 
MM.-estimator followed by one Newton—Raphson step of a GM-estimator. 


5.12.2.2 Estimating the asymptotic covariance matrix under 
heteroskedastic errors 


Simpson ef al. (1992) proposed an estimator for the asymptotic covariance matrix 
of regression GM-estimators that does not require homoskedasticity but requires 
symmetry of the error distribution. Salibian-Barrera (2000) and Croux et al. (2003) 
proposed a method to estimate the asymptotic covariance matrix of a regression 
M-estimator requiring neither homoskedasticity nor symmetry. This method can 
also be applied to simultaneous M-estimators of regression of scale (which includes 
S-estimators) and to MM-estimators. 

We shall give some details of the method for the case of MM-estimators. Let ¥ and 
G be the initial S-estimator used to compute the MM-estimator and the corresponding 
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scale estimator, respectively. Since ¥ and G are M-estimators of regression and of 
scale, they satisfy equations of the form 


> Wo (2) x, =0, (5.98) 
lx ri) = 
LE[o(P)-en om 


with wy = Bip The final MM-estimator B is a solution of 


dw ()s =; (5.100) 
(oy 


i=1 


To explain the proposed method we need to express the system (5.98)—(5.100) as a 
unique set of M-estimating equations. To this end, let the vector y represent the values 
taken on by 7. Let z=(x, y). For y, B € R’ ando ER, put a = (y’,o, B’)’ € R?*', 
and define the function 


Wz, a) = (Vi(Z, a), WY(Z, a), P3(z, a)) 


where 


Wi (zZ,a=y, (: — *) x 
W,(z,a) =p (2 —* *) § 
W3(Z,a@) => (: -f *) x. 


Then @ = (7,6, B)is an M-estimator satisfying 


Y! W(z;,@) = 0, 
i=1 
and therefore according to (3.48), its asymptotic covariance matrix is 
V=A"'BA’! 
with 
A = E[P(z, a) ¥(z, a)'] 
and 


p=5(SE2), 
Oa 


where the expectation is calculated under the model y = x’B + u and taking y = B. 
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Then V can be estimated by 
9 = 9-848" 


where A and B are the the empirical versions of A and B obtained by replacing a 
with @. = 

Observe that the only requirement for V to be a consistent estimator of V is that 
the observations (X,,y,),..., (X,,.y,,) be i.i.d., but this does not require any condition 
on the conditional distribution of y given x, for example homoskedasticity. For the 
justification of the above procedure see Remark 3 and point (viii) of Theorem 6 of 
Fasano et al. (2012). 

Croux et al. (2003) also consider estimators of V when the errors are not inde- 
pendent. 


5.12.3 A robust multiple correlation coefficient 


In a multiple linear regression model, the R? statistic measures the proportion of the 
variation in the dependent variable accounted for by the explanatory variables. It is 
defined for a model with intercept (4.5) as 


V. — Ev? 

= Oe (5.101) 

Var(y) 

and it can be estimated by 

a sa 

R= t— (5.102) 
So 

with , ; 
s° = 2 FSF = Oj - 9, (5.103) 

i=] i=l 


where r; are the LS residuals. Note that y is the LS estimator of the regression coef- 
ficients under model (4.5) with the restriction B,; = 0. 

Recall that S? /(n— p*) and ys /(n — 1) are unbiased estimators of the error vari- 
ance for the complete model, and for the model with B, = 0, respectively. To take the 
degrees of freedom into account, an adjusted coefficient R? is defined by 


go _ So/@-)-S/@—P") 
. Son 1) 


‘ (5.104) 


which is equivalent to 


(5.105) 
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If, instead of the LS estimator, we use an M-estimator based on a bounded 
p-function p and a scale o defined as in (5.7), a robust version of R? can be defined as 


gale (5.106) 
= As AG 
with j 
A. cies"), 428 (+), 
0 = her o( o ) 1 o 


where the factor (1 — A,) in the denominator ensures that 0 < R < 1. Note that we 
are assumming that p € [0, 1]; if p is the classical quadratic function, then (5.106) 
does not coincide with (5.101). 

We can estimate Ke by 


2 _ 
B2 S50 5p 


ba 4 fan? 
(1= $5/A)Si 


r(B) - yi -B 
a i Sie i= Po) 
Ppp De(“S)- sto=made(% 


and where is a robust estimate of the error escale. 

However Re is asymptotically biased as an estimator of R?. In fact it is easy to see 
that in the case that y and uw are normal, we have Re = @(R?), where g is a continuous 
and strictly monotone function different from the identity. The function g is plotted 
in Figure 5.19. Then we can obtain an estimator of R? that is unbiased in the case of 


where 


normal data, as Res = g'(R°). An adjusted estimator of R? for finite samples can 
be obtained by replacing, R? in the right-hand side of (5.105) by BE os 


To compute the function g, we can assume, without loss of generality, that o = 1. 
Since in this case u ~ N(0, 1) and y ~ N(EQ), 1/1 — R?), we have A, = E(p(u)) 


and Ay = E(p(u/V 1 — R?). 
Croux and Dehon (2003) have considered alternative definitions of a robust R’. 
They proposed an estimator of R? of the form 


8B), 5 PB) 
PO — Rims ¥e— 2) 


gp) 


2 =] (5.107) 
where s is a robust scale estimator, B a robust regression estimator of B and fi a 
robust location estimator applied to the sample y,,...,y,- If s and j@ are consistent 
for the standard deviation and mean at the. normal distribution, then R2 is consistent 
for the classical R? for normal models. If 6 and ff are S-estimators of regression and 
location that minimize the scale s, then this R* is nonnegative. This need not happen 
in all cases if other estimators are employed. 
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Figure 5.19 Relationship between R* and R* 


The estimator R, is especially adapted to express the proportion of explained 
variability of regession M-estimators. Since the regression estimators proposed in this 
book all have the form of M-estimators, we recommend this correlation coefficient 
rather than (5.107). 


5.13. *Appendix: proofs and complements 


5.13.1 The BP of monotone M-estimators with random X 


We assume o known and equal to one. The estimator verifies 


v1, — X, Bx, + YY wo; — x/B)x; = 0. (5.108) 
i=2 


Let y,; and x, tend to infinity in such a way that y, /||x, || — oo. If B remained 
bounded, we would have 


nan n y n 
yy, — XB > yy — UX MB = Ix Il (ay = ial > 00, 
1 


Since y is nondecreasing, y(y, — x! B) would tend to sup y > 0, and hence the first 
term in (5.108) would tend to infinity, while the sum would remain bounded. 
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5.13.2 Heavy-tailed x 


The behavior of the estimators under heavy-tailed x is most easily understood when 
the estimator is the LS estimator and p = 1; that is, 


Yj = Px; + Uj, 
where {x;} and {u;} are independent 1.i.d. sequences. Then 
a a MiVi = 2 
B,, = a with L = dL 
Assume Eu; = 0 and Var(u;) = 1. Then 
Var(B,,[X) = - and E(B,|X) = f, 


and hence, by a well-known property of the variance; see, for example, Feller (1971): 


Var( ng) = n{E[Var(B,,|X)] + VarlE@,|X)]} = Ez 7a 


Ifa= Ex < oo, the law of large numbers implies that 7,,/n >, a, and, under 
suitable conditions on the x;, this implies that 

1 1 

—— = =, 

T,/n a 


n 


(5.109) 


hence 


Var(/np,) > =. 


In other words, B 1S s/n-consistent. 
If, instead, Ex? = 00, then T,,/n >, co, which implies that 
I P 
Var(1/nB,,) > 0, 


and hence B tends to f# at a higher rate than Jn. 

A simple sufficient condition for (5.109) is that x; >a for some a> 0, 
for then n/T,, < 1/a* and (5.109) holds by the bounded convergence theorem 
(Theorem 10.6). But the result can be shown to hold under more general assumptions. 


5.13.3 Proof of the exact fit property 


Define for t € R: 
y' =y+ty—Xy). 


Then the regression and scale equivariance of B implies 
BUX, y*) = BX, y) + B(Xy) — 9). 


Since for all ¢, y* has at least g > n — m* values in common with y, the above 
expression must remain bounded, and this requires B(X, y) — y = 0. 
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5.13.4 The BP of S-estimators 


It will be shown that the finite BP of an S-estimator defined in Section 5.4.1 does not 
depend on y, and that its maximum is given by (5.19)-(5.20). 

This result has been proved by Rousseeuw and Leroy (1987) and Mili and Coak- 
ley (1996) under slightly more restricted conditions. The main result of this section 
is the following. 


Theorem 5.1 Let m* be as in (5.9) and mj,,x as in (5.20). Call m(6) the largest 
integer < nd. Then: 


(a) m* <né 
(b) if {nd < mi 


max?’ 


then m* > m(6). 


It follows from this result that if n6é is not an integer and 6 < m*,,/n, then 
m* = [nd], and hence the 6 given by (5.21) yields m* = m},,,. 
To prove the theorem we first need an auxiliary result. 


Lemma 5.1 Consider any sequence ry = (ry)... yn) With oy = G(ty). Then 
(i) Let C = {i: |ry,;| > co}. If #(C) > n6, then oy > 0, 

(ii) Let D = {i : |ry;| is bounded}. [f#(D) > n — né, then oy is bounded. 
Proof of the Lemma 


(i) Assume oy bounded. Then the definition of oy implies 


Pays 
nd > lim )" p (=) = #(C) > n6, 
Nee ieC on 


which is a contradiction. 
(ii) To show that oj, remains bounded, assume that 0, — oo. Then ry ;/oy > 0 for 
i € D, which implies 


n rr zi r * 
nd = lim) p (=) =lim ¥ p (=) <n—#(D) <né. 
N-0co = On Noo igD On 


which is a contradiction. 


Proof of (a): It will be shown that m* < né. Let m>né. Take C C {1,...,n} with 
#(C) = m. Let xp € R? with ||xp|| = 1. Given a sequence (Xj, yy), define for B € R? 


ry(B) = yy — XyB. 
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Take (Xj, yy) such that 


(NXo, N7) if iEec 


(x;,y,;) otherwise (5.110) 


(Xvi Ya) = { 
It will be shown that the estimator B ny based on (Xj, Yj),) cannot be bounded. 
Assume first that By is bounded, which implies that |ry ;| > 00 for i € C. Then 
part (i) of the lemma implies that (ry (By)) = oo. Since né/m < 1 = p(oo), condi- 
tion R3 of Definition 2.1 implies that there is a single value y such that 


»(2)-%. (5.111) 
y m 
It will be shown that 1 
yn ot Bn) a (5.112) 

In fact, 7 a 

yi — X)B N* — Nx4B 

né = > p( ——~ ]+ > p{| ——— }. 
i€C on ieC on 


The first sum tends to zero. The second one is 


1-N x By 
MON Ny 


The numerator of the fraction tends to one. If a subsequence {N On } has a (possibly 
infinite) limit t, then it must fulfill 16 = mp(1 /t), which proves (5.112). 
Now define By = X) N/2, so that r)(By) has elements 


7 N? i N 
es fori€C, Fyi = yi — XOX 5 otherwise. 


Since #{i : |7y;| > co} =n, part (i) of the lemma implies that 6(ry(By)) — oo, and 
proceeding as in the proof of (5.112) yields 


Mw ~ Y 
2 ow Bw)) =a 3 


and hence 


G(ry(By)) < F(ty(By)) 


for large N, so that By cannot minimize o. 


Proof of (b): Let m < m(6) < né, and consider a contamination sequence in a set C 
of size m. It will be shown that the corresponding estimator By is bounded. Assume 
first that By — co. Then 


n Al 
i€C, |ry(By)l| > => By Xv; #9, 
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and hence 
Hi: [ry By)| + 00} = HBy Xys # 0,7 EC} 
=n-#({i: By xy; =0} UC). 
The Bonferroni inequality implies that 
H(i: By xy; =O}UC) <#{i : By xy; =0} +#(O, 
and #{i : By Xy; = 0} < k*(X) by (4.56). Hence 
#{i: [ry (By) > co} > n- K(X) - mm. 
Now (4.58) implies that n — k*(X) > 2m*,,, + 1, and since 
m < m(6) < [nd] < m},x, 


we have 
#{i > [ry (By)| > co} = 1 +m, = 1+ [nd] > nd, 
which by part (i) of the Lemma implies (ry (By)) > o. 
Assume now that By is bounded. Then 


#{i: lB > c}<m<né, 


which by part (ii) of the lemma implies (ry (By)) is bounded. Hence By cannot be 
unbounded. This completes the proof of the finite BP. 


The least quantile estimator corresponds to the scale given by p(t) = I(|t| > 1), and 
according to Problem 2.14 it has 6 = |r| (n> Where |r|(; are the ordered absolute resid- 
uals and h = n—[né]. The optimal choice of 6 in (5.21) yields h = n — mj,,,, and 
formal application of the theorem would imply that this / yields the maximum FBP. 
In fact, the proof of the theorem does not hold because p is discontinuous and hence 
does not fulfill (5.111), but the proof can be reworked for 6 = |r|, to show that the 
result also holds in this case. 

As for the asymptotic BP, a proof similar to but much simpler than that of 
Theorem 5.1, with averages replaced by expectations, shows that in the asymp- 
totic case e* <6 and if 6 <(1—a)/2 then e* >6. It follows that e* = 6 for 
6 < (1 -a)/2, and this proves that the maximum asymptotic BP is (5.22). 


5.13.5 Asymptotic bias of M-estimators 


Let F = D(x, y) be N, (0,1). We shall first show that the asymptotic bias under 
point-mass contamination of M-estimators and of estimators that minimize a robust 
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scale does not depend on the dimension p. To simplify the exposition, we consider 
only the case of an M-estimator with known scale o = 1. Call (Xo, yo) the contami- 
nation location. The asymptotic value of the estimator is given by 


Bo = arg min L(B), 


where 
L(p) = (1 — e)Erp(y — xB) + €p(9 — X58). (5.113) 


Since D(y — x’ B) = N(O, 1 + ||B||?) under F, we have 
Expy — xB) = a(IIBll). 


g(t) = Ep (vi + P) , z~N(0,1). 


It is easy to show that g is an increasing function. 
By the affine equivariance of the estimator, ||6,,|| does not change if we take xq 
along the first coordinate axis; in other words, of the form xg = (Xo, 0, .., 0). Therefore, 


L(B) = 1 — e)g(IIBID + €0% — X81), 


where f, is the first coordinate of p. 
Given B = (f,,f>,...58,), with f #0 for some j 22, the vector B= 


(B,,0,...,0) has |[Bl| < |IBl|, which implies — g(||Bll) < g({|BIl) and L(B) < L(A). 
Then, we may restrict the search to the vectors of the form (f,,0,...,0), for which 


L(B) = L,(B,) = — €)g(By) + €a9 — X0h,); 


and therefore the value minimizing L,(f,) depends only on xg and yo, and not on p, 
which proves the initial assertion. 

It follows that the maximum asymptotic bias for point-mass contamination does 
not depend on p. Actually, it can be shown that the maximum asymptotic bias for 
unrestricted contamination coincides with the former, and hence does not depend on 
p either. 

The same results hold for M-estimators with the previous scale, and for 
S-estimators, but the details are more involved. 


where 


5.13.6 Hampel optimality for GM-estimators 


We now deal with general M-estimators for regression through the origin (y = 
Px + u), defined by 


DY VO 955 B)=0, 
i=1 


with of the form 
P(x, y3 B) = n(x, y — xB)x. 
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Assume o is known and equal to one. It follows that the influence function is 


1 
IF((%: Yo)s F) = po» Yo — Xo )Xo> 


where 
b = —En(x, y — Bx)x’, 


with 7 defined in (5.89), and hence the GES is 


%e 1 : 
y = sup|IF((x, yo), F)| = b sup K(s), with K(s) = sup|n(s, r)|. 
s>0 r 


*x0.Y0 


The asymptotic variance is 
1 
v= pa En y- xp)x° 


The direct and dual Hampel problems can now be stated as minimizing v subject 
to a bound on y*, and minimizing y* subject to a bound on v, respectively. 

Let F correspond to the model (5.1)-(5.2), with normal us. The MLE corre- 
sponds to 


Yo y; B) =(y — xf)x. 


Since the estimators are equivariant, it suffices to treat the case of # = 0. Proceed- 
ing as in Section 3.8.7, it follows that the solutions to both problems, 


V(x, y; B) = Hx, y — xB)x, 


have the form (x, y; 8) = w,(Yo(x, y; B)) for some k > 0, where y;, is Huber’s y, 
which implies that 7 has the form (5.86). 

The case p > | is more difficult to deal with, since # — and hence the IF — are 
multidimensional. But the present reasoning gives some justification for the use of 
(5.86). 


5.13.7 Justification of RFPE* 


We are going to give a heuristic justification of (5.39). Let (x;, y;), i= 0, 1,..., be 
iid. and satisfy the model 


yj =x, Btu; (i =0,...,n), (5.114) 
where u; and x; are independent, and 
Uj 
Ey (=) =0. (5.115) 
o 


Call Co = {j : B; # 0} the set of variables that actually have some predictive power. 
Given C C {1,...,p} let 


Bo= GEO, Ko=OyifEOQ, 1=0,.,n. 
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Put g = #(C) and call Be E Ri the estimator based on {(xX;,c, y;),i= 1,...,n}. 
Then the residuals are r,; = y; — BeXic fori =1,. ; 
Assume that C 2 Cy. Then x/B = = xicBc anid 3 the model (5.114) can be 
rewritten as 
y; = XicBo t uj, P= 0,267: (5.116) 


Put A = Bc — Bc. A second-order Taylor expansion yields 


Yo ~ Boxe Wy —XocA 
p| —— ]=p| ———_ (5.117) 
o o 
Uu u x’ A 1 Uu x! A A 
x o()- wi + byrca( ve ) 
Oo o oOo 2 oO fo 


The independence of up and Be and (5.115) yield 


Ey (=) x, A= Ey (= 2) BGK Ay =: (5.118) 


According to (5.14), we have, for large n, 


2 
D(/nA) » N @ vay) : 


where 
= 2(u = u = 
A=Eyw ‘Ge B=Ey'(*), V = E(XcX pc): 


Since up, A and X9 are independent we have 


u AlXoc \? u A’Xoc \? 
Ey’ (2) 0C =By'(2)k 0c 
"No o ¥\o o 
ee eee 
~) Bo EXc¥ Xgc: (5.119) 
Let U be any matrix such that V = UU’, and hence such that 
E(U"!xoc)(U7'Xoc)’ = I, 
where I, is the g x g identity matrix. Then 
ExjcV'Xoc = E||U~'Xpcll? = trace(I,) = g, (5.120) 
and hence (5.117), (5.118) and (5.119) yield 


RFPE(C) = Ep (2) pe (5.121) 
(oy 
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To estimate RFPE(C) using (5.121) we need to estimate Ep(ug/o). A second- 


ry [ux A 
iy (3 =) (5.122) 


n 


order Taylor expansion yields 


1 n rn 
7 (a) 


~) 
n 


QR 


The estimator B¢ satisfies the equation 


Yv(¥)xe=0 (5.123) 


and a first-order Taylor expansion of (5.123) yields 


n 

Pe U; , U; , A 

~ >, W\A)Xic 7a WIAs (Xic \ics 
=I o o oO 


and hence 
n 


- U; ! i 
Dv (2) xc * yw (2) oe A)xic; 


(5.124) 


LS o(2)=2¥ (Lhe Dwi (B) oar? 


Since 
1 (4 ! 1 (YO i 
Yv (=) KicXi¢ >, Ey (=) E(Xp¢x/,.) = BY, 


zalRe 
as) 
—— 
rls 
MM" 
2 
ale 
tM = 
as) 
Gs 
|= 
“—_— 
iw) 
Q>| 
Ne 
a 
< 
> 


ll 
=I 
my 
OS 
|S 
Scere 

| 
NO 
BS) > 
3 

Ss 
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Hence by the law of large numbers and the consistency of ¢ 
u ie U; ie r; Aq 
Ep(+)~ (2) * (2)+ 5.125 
P\G n 2 rhe n p Oke 2Bn ( ) 
and finally, inserting (5.125) in (5.121) yields 


n r. A n r. A 
RFPE(C) ~ “ ve (2) reas *¥ (2) + 4 = RFPE*(O. 
i=l o Bn 


= ~ 
o Bn on = 


5.14. Recommendations and software 


For linear regression with random predictors or fixed predictors with high leverage 
points, we recommend the MM-estimator with optimal p using the Pefia—Yohai esti- 
mator as initial value (Section 5.9.4), implemented in ImrobdetMM (RobStatTM); 
and the DCML estimator (Section 5.9.3.2) with optimal p, implemented in Imrob- 
detDCML(RobStatTM), which uses the output from ImrobdetMM to boost its 
efficiency. 

To compute regularized estimators for the linear model we recommend the func- 
tion pense (pense), which computes an MM version of the elastic net estimator. As 
particular cases, it can compute the MM ridge and MM lasso. 

For model selection we recommend step.ImrobdetMM (RobStatTM), 
which uses the RFPE criterion (Section 5.6.2), and is based on the output from 
ImrobdetMM. 

Other possible options to compute robust estimators for linear regression are: 


initPY (pyinit) computes the Pefia—Yohai estimator defined in Section 5.7.4, 
which is used as initial estimator of ImrobdetMM. 

e Imrob (robustbase) and ImRob (robust) are general programs that com- 
pute MM- and other robust estimators. Both use as initial value an S-estimator 
(Section 5.4.1). The first employs subsampling (Section 5.7.2) while the second 
allows the user to choose between subsampling and Pefia—Yohai. 


5.15 Problems 


5.1. Show that S-estimators are regression, affine and scale equivariant. 


5.2. The stack loss dataset (Brownlee, 1965, p. 454) given in Table 5.12 are obser- 
vations from 21 days’ operation of a plant for the oxidation of ammonia, a stage 
in the production of nitric acid. The predictors X,, X,, X3 are respectively the 
air flow, the cooling water inlet temperature, and the acid concentration, and 
the response Y is the stack loss. Fit a linear model to these data using the LS 
estimator, the MM-estimators with efficiencies 0.95 and 0.80, and the DCML 
estimator, and compare the results. Fit the residuals against the day. Is there 
a pattern? 
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5:3. 


5.4. 


5:5. 
5.6. 
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Table 5.12 Stack loss data 
Day xX, X, X; Y 
1 80 27 58.9 4,2 
2 80 27 58.8 3.7 
3 75 25 59.0 3.7 
4 62 24 58.7 2.8 
5 62 22 58.7 1.8 
6 62 23 58.7 1.8 
7 62 24 59.3 1.9 
8 62 24 59.3 2.0 
9 58 23 58.7 1.5 
10 58 18 58.0 1.4 
11 58 18 58.9 1.4 
12 58 17 58.8 1.3 
13 58 18 58.2 1.1 
14 58 19 59.3 1.2 
15 50 18 58.9 0.8 
16 50 18 58.6 0.7 
17 50 19 57.2 0.8 
18 50 19 57.9 0.8 
19 50 20 58.0 0.9 
20 56 20 58.2 1.5 


21 70 20 59.1 1.5 


The dataset alcohol (Romanelli et al., 2001) gives, for 44 aliphatic alcohols, the 
logarithm of their solubility together with six physicochemical characteristics. 
The interest is in predicting the solubility. Compare the results of using the 
LS, MM and DCML estimators to fit the log-solubility as a function of the 
characteristics. 


The dataset waste (Chatterjee and Hadi, 1988) contains, for 40 regions, the 
solid waste and five variables on land use. Fit a linear model to these data 
using the LS, L;, MM and DCML estimators. Draw the respective Q—Q plots 
of residuals, and the plots of residuals against fitted values, and compare the 
estimators and the plots. 


Show that the “median of slopes” estimator (5.93) is a GM-estimator (5.85). 


For the “median of slopes” estimator and the model y; = fx; + u;, calculate the 
following, assuming that P(x = 0) = 0: 

(a) the asymptotic BP 

(b) the influence function and the gross-error sensitivity 

(c) the maximum asymptotic bias (hint: use (3.68)). 
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5.7. 


5.8. 


5.9. 


5.10. 


5.11. 


Show that when using the shortcut (5.44), the number of times that the M-scale 
is computed has expectation ee ,/i) < InN, where N is the number of sub- 
samples. 


The minimum a-quantile regression estimator is defined for a € (0, 1) as the 
value of # minimizing the a—quantile of |y — x’ B|. Show that this estimator is 
an S-estimator for the scale given by p(u) = I({u| > 1) and 6 = (1 — a). Find 
its asymptotic breakdown point. 


For each f let c(B) the minimum c such that 
#{i: Bx; -—c <y; < p’x; +c} >n/2. 
Show that the LMS estimator minimizes c(f). 


Let {(X,,),),-.--, (X,,),)} be a regression dataset, and B an S-estimator with 
finite BP equal to e*. Let D C (1, .., 1) with #(D) < ne*. Show that there exists 
K such that: 


(a) B as a function of the y; is constant if the y, with i ¢ D remain fixed and 
those with i € D are changed in any way such that |y,| > K. 
(b) there exists G depending only on D such that B verifies 


3 (2) — 
p| —— )=min. 
6G 


igD 


Then: 

(c) Discuss why property (a) does not mean that the that the value of the esti- 
mator is the same as if we omit the points (x,, y;) with 7 € D. 

(d) Show that properties (a)-(c) also hold for MM-estimators. 


Show that the adaptive MM-lasso can be computed by transforming the 
regressors as Xi = | Bil °x;z- 


6 


Multivariate Analysis 


6.1 Introduction 


Multivariate analysis deals with situations in which several variables are measured 
on each experimental unit. In most cases of interest it is known or assumed that 
some form of relationship exists among the variables, and hence that considering 
each of them separately would entail a loss of information. Some possible goals of 
the analysis are: 


e reduction of dimensionality (principal components, factor analysis, canonical 
correlation); 

e identification (discriminant analysis); 

e explanatory models (multivariate linear model). 


The reader is referred to Seber (1984) and Johnson and Wichern (1998) for further 
details. 


A p-variate observation is now a vector x = (xj,..., Xp)! € R?’, and a distribution 
F now means a distribution on R?. In the classical approach, location of a p-variate 
random variable x is described by the expectation wy = Ex = (Ex,,..., Ex,,)’, and scat- 


ter is described by the covariance matrix 


Var(x) = E((x — p)(x — p)’). 


It is well known that Var(x) is symmetric and positive semidefinite, and that for each 
constant vector a and matrix A 


E(Ax+a)=AEx-+a, Var(Ax+ a) = AVar(x)A’. (6.1) 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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Classical multivariate methods of estimation are based on the assumption of an 


ii.d. sample of observations X = {x,,...,X,} with each x; having a p-variate normal 
N,(#, &) distribution with density 


F(x) 


1 1 ty 
= exp (-3(x— w)'E""x—y)), (6.2) 
(2x)P/2/|z| 
where & = Var(x) and || stands for the determinant of &. The contours of constant 
density are the elliptical surfaces 


{z: (2—p)'='(z—- p) =c}. 


Assuming x is multivariate normal implies that for any constant vector a, all linear 
combinations a’x are normally distributed. It also implies that since the conditional 
expectation of one coordinate with respect to any group of coordinates is a linear 
function of the latter, the type of dependence among variables is linear. Thus methods 
based on multivariate normality will yield information only about linear relationships 
among coordinates. As in the univariate case, the main reason for assuming normality 
is simplicity. 

It is known that under the normal distribution (6.2), the MLEs of mw and & for a 
sample X are respectively the sample mean and sample covariance matrix 


X = ave(X) = u > x, Var(X) = ave{(X — x)(X — x)’}. 
| 


The sample mean and sample covariance matrix share the behavior of the distri- 
bution mean and covariance matrix under affine transformations, namely (6.1), for 
each vector a and matrix A 


ave(AX +a) = Aave(X)+a, Var(AX +a) = AVar(X)A’, 


where AX + a is the dataset {Ax; +a, i= 1,...,n}. This property is known as the 
affine equivariance of the sample mean and covariances. 

Just as in the univariate case, a few atypical observations may completely alter 
the sample means and/or covariances. Worse still, a multivariate outlier need not be 
an outlier in any of the coordinates considered separately. 


Example 6.1 The dataset in Seber (1984; Table 9.12) contains biochemical mea- 
surements on 12 men with similar weights. The data are plotted in Figure 6.1. The 
tables and figures for this example are obtained with script biochem.R. 


We see in Figure 6.1 that observation 3, which has the lowest phosphate value, stands 
out clearly from the rest. However, Figure 6.2, which shows the normal QQ plot of 
phosphate, does not reveal any atypical value, and the same occurs in the QQ plot 
of chloride (not shown). Thus the atypical character of observation 3 is visible only 
when considering both variables simultaneously. 
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Chloride 
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Figure 6.1 Biochemical data 


Phosphate 


-2 -1 0 1 2 
Quantiles of Standard Normal 


Figure 6.2 Normal QQ plot for phosphate 


Table 6.1 below shows that omitting this observation has no major effect on means 
or variances, but the correlation almost doubles in magnitude; that is, the influence 
of the outlier has been to decrease the correlation by a factor of two relative to the 
situation without the outlier. 


198 MULTIVARIATE ANALYSIS 


Table 6.1 The effect of omitting a bivariate outlier 
Means Vars. Correl. 


Complete data 1.79 6.01 0.26 3.66 -—0.49 
Without obs.3 1.87. 6.16 = 0.20. 3.73, —-0.80 


Here we have an example of an observation that is not a one-dimensional outlier in 
either coordinate but strongly affects the results of the analysis. This example shows 
the need for robust substitutes for the mean vector and covariance matrix, and this 
will be the main theme of this chapter. The substitutes are generally referred to as 
multivariate location vectors and scatter matrices. The latter are also called robust 
covariance matrices in the literature. 

Some methods in multivariate analysis make no use of means or covariances; 
example include Breiman et al.’s (1984) nonparametric classification and regression 
trees (CART) methods. To some extent, such (nonequivariant) methods have a certain 
built-in robustness. But if we want to retain the simplicity of the normal distribution as 
the “nominal” model, with corresponding linear relationships, elliptical distributional 
shapes and affine equivariance for the bulk of the data, then the appropriate approach 
is to consider slight or moderate departures from normality. 

Let (fi(X), E(x) be location and scatter estimators corresponding to a sample 
X = {x),...,X, }. Then the estimators are affine equivariant if 


A(AX +b) = AA(X) +b, E(AX +a) = ASA’. (6.3) 


Affine equivariance is a desirable property of an estimator. The reasons are given 
in Section 6.17.1. This is, however, not a mandatory property, and may in some cases 
be sacrificed for other properties such as computational speed; an instance of this 
trade-off is considered in Section 6.9.1. 

As in the univariate case, one may consider outlier detection methods. The 
squared Mahalanobis distance between the vectors x and ym with respect to the 
matrix & is defined as 


d(x, M,Z) = (x — p)/E71(x — p). (6.4) 


For simplicity, d will be sometimes referred to as “distance”, although it should 
be kept in mind that it is actually a squared distance. Then the multivariate ana- 
logue of i, where ¢; = (x; — x)/s is the univariate outlyingness measure in (1.3), 
is D; = d(x,,x, C), with C = Var(X). When p = | we have D,; = tn/(n — 1). It is 
known (Seber, 1984) that if x ~ N, (4 X), then d(x, uw, XZ) ~ ven Thus, assuming the 
estimators x and C are close to their true values, we may examine the QQ plot of D, 
against the quantiles of a a, distribution and delete observations for which D; is “too 
high”. This approach may be effective when there is a single outlier but, as in the 
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case of location, it can be ineffective when n is small (recall Section 1.3) and, as in 
regression, several outliers may mask one another. 


Example 6.2 The following dataset is a part of one given by Hettich and Bay 
(1999). It contains, for each of 59 wines grown in the same region in Italy, the 
quantities of 13 constituents. The original purpose of the analysis (de Vel et al. 
1993) was to classify wines from different cultivars by means of these measurements. 
In this example we treat cultivar one. The tables and figures for this example are 
obtained with script wine.R. 


The upper row of Figure 6.3 shows the plots of the classical squared distances as a 
function of observation number, and their QQ plot with respect to the x distribution. 
No clear outliers stand out. 

The lower row shows the results of using a robust estimator (called the “MM”’), 
to be defined in Section 6.4.4. At least seven points stand out clearly. The failure of 
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Figure 6.3 Wine example: Mahalanobis distances versus index number for classical 
and robust estimators (left), and QQ plots of distances (right) 
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the classical analysis in the upper row of Figure 6.3 shows that several outliers may 
“mask” one another. These seven outliers have a strong influence on the results of the 
analysis. 

Simple robust estimators of multivariate location can be obtained by applying a 
robust univariate location estimator to each coordinate, but this lacks affine equivari- 
ance. For scatter, there are simple robust estimators of the covariance between two 
variables (pairwise covariances) that could be used to construct a robust covariance 
matrix (see Devlin et al., 1981; Huber and Ronchetti, 2009). 

Apart from not being equivariant, the resulting matrix may not be positive 
semidefinite. See, however, Section 6.9 for an approach that ensures positive 
definiteness and “approximate” equivariance. Nonequivariant procedures may also 
lack robustness when the data are very collinear (Section 6.7). In subsequent sections 
we shall discuss a number of equivariant location and scatter estimators. 

Note that if the matrix z with elements Oixs j,k =1,...,p,18 a “robust covariance 
matrix’, then the matrix R with elements 


Ojk 
ry = lt (6.5) 


is a robust analog of the correlation matrix. 


6.2 Breakdown and efficiency of multivariate 
estimators 


The concepts of breakdown point (BP) and efficiency will be necessary to understand 
the advantages and drawbacks of the different families of estimators discussed in this 
chapter. 


6.2.1 Breakdown point 


To define the BP of (fi, S) based on the ideas in Section 3.2, we must establish the 
meaning of “bounded, and also bounded away from the boundary of the parameter 
space”. For the location vector, the parameter space is a finite-dimensional Euclidean 
space, and so the statement means simply that f remains in a bounded set. However, 
the scatter matrix has a more complex parameter space consisting of the set of sym- 
metric nonnegative definite matrices. Each such matrix is characterized by the matrix 
of its eigenvalues. Thus 3 bounded, and also bounded away from the boundary” is 
equivalent to the eigenvalues being bounded away from zero and infinity. 

From a more intuitive point of view, recall that if & = Var(x) and a is a constant 
vector, then Var(a’x) = a’Za. Hence if Z is any robust scatter matrix, then /a’Za 
can be considered as a robust measure of scatter of the linear combination a’x. Let 
A\(%) 2... 2 A,(2) be the eigenvalues of & in descending order, and e;,...,e, the 
corresponding eigenvectors. It is a fact of linear algebra that for any symmetric matrix 
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Z, the minimum (resp. maximum) of a’Za over ||a|| = 1 is equal to A, (A,) and this 
minimum is attained for a = e,, (e, ). If we are interested in linear relationships among 
the variables, then it is dangerous not only that the largest eigenvalue becomes too 
large (“explosion’’) but also that the smallest one becomes too small (“implosion”). 
The first case is caused by outliers (observations far away from the bulk of the data), 
the second by ”inliers” (observations concentrated at some point or in general on a 
region of lower dimensionality). 

For O<m<vn, call Z,, the set of “samples” Z = {z,,...,z,}, such that 
# {z; = x;} = n-m, and call fi(Z) and E(Z) the location vector and scatter matrix 
estimators based on the sample Z. The finite breakdown point (FBP) of (fi, £) is 
defined as e€* = m*/n, where m* is the largest m such that there are finite positive 
a, b,c such that 


(ZI <a and b < A,B(Z) < ABD) < 


for all Z € Z,,. 

For theoretical purposes it may be simpler to work with the asymptotic BP. An 
€-contamination neighborhood F(F, €) of a multivariate distribution F is defined as 
in (3.3). Applying Definition 3.2 and (3.20) to the present context we have that the 
asymptotic BP of (fi, £) is the largest e* € (0, 1) for which there exist finite positive 
a, b,c such that the following holds for all G: 


|#,.( — €)F + €G)|| <a, 
b<A,(2(1—6)F +€G)) < 4,21 -2)F +€G)) <c. 


In some cases we may restrict G to range over point-mass distributions, and in that 
case we use the terms “point-mass contamination neighborhoods” and “point-mass 
breakdown point”. 


6.2.2 The multivariate exact fit property 


A result analogous to that of Section 5.12.1 holds for multivariate location and scatter 
estimation. Let the FBP of the affine equivariant estimator (ji, £) be e* = m*/n. Let 
the dataset contain q points on a hyperplane H = {x : B’x = y} for some B € R? and 
y ER. If gq >n-—m* then fi € H, and Sp = 0. The proof is given in Section 6.17.8. 


6.2.3 Efficiency 


The asymptotic efficiency of (fi, S) is defined as in (3.46). Call (fi, 3) the estimators 
for a sample of size n, and let (fi,,, = be their asymptotic values. All estimators 
considered in this chapter are consistent for the normal distribution in the following 
sense: if x; ~ N,(, 2) then Ho = wand z.. = cz where c is a constant (if c = | we 
have the usual definition of consistency). This result will be seen to hold for the larger 
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family of elliptical distributions, to be defined later. Most estimators defined in this 
chapter are also asymptotically normal: 


Vilfin — Fig) NpO,V,), avec, — B..)>N,(0, Vz). 


where g = p(p + 1)/2 and, for a symmetric matrix &, vec(X) is the vector containing 
the gq elements of the upper triangle of 2. The matrices V,, and Vy are the asymptotic 
covariance matrices of ji and Z. In general, the estimator can be defined in such a 
way that c = | for a given model, say the multivariate normal. 

We consider the efficiency of fi when the data have a N,(#, 2) distribution. In 
Section 6.17.2 it is shown that an affine equivariant location estimator fi has an 
asymptotic covariance matrix of the form 


V, = 0%, (6.6) 


where v is a constant depending on the estimator. In the case of the normal distribu- 
tion MLE x we have v = | and the matrix Vo in (3.46) is simply 2, which results in 
Vi Vo = v~'Tand eff(fi) = 1/v. Thus the normal distribution efficiency of an affine 
equivariant location estimator is independent of w and &. With three exceptions, con- 
sidered in Sections 6.9.1, 6.12 and 6.13,the location estimators treated in this chapter 
are affine equivariant. 

The efficiency of E is much more complicated and will not be discussed here. It 
has been dealt with by Tyler (1983) in the case of the class of M-estimators, which 
are defined in the next section. 


6.3 M-estimators 


Multivariate M-estimators will now be defined, as in Section 2.3, by generalizing 
MLEs. Recall that in the univariate case it was possible to define separate robust 
equivariant estimators of location and of scatter. This is more complicated to do 
in the multivariate case and, if we want equivariant estimators, it is better to esti- 
mate location and scatter simultaneously. We shall develop the multivariate analog 
of simultaneous M-estimators (2.71)—(2.72). Recall that a multivariate normal density 
has the form 1 


viz! 


where h(s) = cexp(—s/2), with c = (2x)~?/? and d(x, w,Z) = (x — p)/=7! (x — p). 
We note that the level sets of f are ellipsoidal surfaces. In fact, for any choice of 
positive h such that f integrates to one, the level sets of f are ellipsoids, and so any 
density of this form is called elliptically symmetric (henceforth “elliptical” for short). 
In the special case where w = 0 and & = cI, a density of the form (6.7) is called 
spherically symmetric or radial (henceforth “spherical” for short). It is easy to 
verify that the distribution D(x) is elliptical if and only if for some constant vector 


f(x, WH, X) = 


h(d(x, pt, X)) (6.7) 
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a and matrix A, D(A(x — a)) is spherical. An important example of a nonnormal 
elliptical distribution is the p-variate Student distribution with v degrees of freedom 


(0 < v < oc), which will be denoted as T pw and is obtained by the choice 


c 
h(s) = Ganomn (6.8) 
where c is aconstant. The case v = | is called the multivariate Cauchy density, and the 
limiting case v > oo yields the normal distribution. If the mean (resp. the covariance 
matrix) of an elliptical distribution exists, then it is equal to w (resp. a multiple of X) 
(Problem 6.1). More details on elliptical distributions are given in Section 6.17.9. 
Let x,,...,X, be ani.i.d. sample from anf of the form (6.7) in which A is assumed 
everywhere positive. To calculate the MLE of yw and &, note that the likelihood 
function is 


L(u, 2) = =p sae I h(d(x;, u,)), 


and maximizing L(j, X) is equivalent to 
~2In L(u,Z) = nin |S| + > e(d;) = min, (6.9) 
i=l 


where 7 
p(s) = —21n h(s) and d; = d(x;, fi, X). (6.10) 


Differentiating with respect to w and & yields the system of estimating equations (see 
Section 6.17.3 for details): 


>, W(d)(x; — A) = 0 (6.11) 


i=l 
lx a oe 
= ¥) Wd )% — D(x; — WY = ¥, (6.12) 
Mazi 
with W = p’. For the normal distribution we have W = 1, which yields the sam- 


ple mean and sample covariance matrix for fi and Z. For the multivariate Student 
distribution (6.8) we have 


pty 
W(d) = : 6.13 
(d) tae (6.13) 
In general, we define M-estimators as solutions of 
YW (d)%;, - D =0 (6.14) 


i=1 


lx e hen dt 
- p W(d))(X; — f(x; — WY! = ¥, (6.15) 
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where the functions W, and W, need not be equal. Note that by (6.15) we may 
interpret X as a weighted covariance matrix, and by (6.14) we can express fi as the 
weighted mean 


DX Wi (4)x; 
a= =——_., (6.16) 
W,(d)) 
=1 


U 


with weights depending on an outlyingness measure d;. This is similar to (2.32) in 
that with w; = W,(d;) we can express fi as a weighted mean with data-dependent 
weights. 

Existence and uniqueness of solutions were treated by Maronna (1976) and more 
generally by Tatsuoka and Tyler (2000). Uniqueness of the solutions of (6.14)—(6.15) 
requires that dW,(d) be a nondecreasing function of d. To understand the reason for 
this condition, note that an M-scale estimator of a univariate sample z may be written 


as the solution of 
s = e(0(2)) =we((2) (2). 
o 0 o 


where W(t) = p(t)/t. Thus the condition on the monotonicity of dW,(d) is the mul- 
tivariate version of the requirement that the p-function of a univariate M-scale be 
monotone. 

We shall call an M-estimator of location and scatter monotone if dW>(d) is 
nondecreasing, and redescending otherwise. Monotone M-estimators are defined 
as solutions to the estimating equations (6.14)-(6.15), while redescending ones 
must be defined by the minimization of some objective function, as happens with 
S-estimators or CM-estimators, to be defined in Sections 6.4 and 6.16.2 respec- 
tively. Huber and Ronchetti (2009) consider a slightly more general definition of 
monotone M-estimators. For practical purposes, monotone estimators are essentially 
unique, in the sense that all solutions to the M-estimating equations are consistent 
estimators. 

It is proved by Huber and Ronchetti (2009; Ch. 8) that if the x; are i.i.d. with dis- 
tribution F, then under general assumptions when n > oo, monotone M-estimators, 
defined as any solution i and E of (6.14) and (6.15), converge in probability to the 
solution (fig, Zoo) of 

EW, (d)(X — fig.) = 0, (6.17) 


EW,(d)(x — fig (X — fin)’ = (6.18) 


where d = d(x, fi,,, = They also prove that \/n(fi— fa... = =) tends to a multi- 
variate normal distribution, It is easy to show that M-estimators are affine equivariant 
(Problem 6.2) and so if x has an elliptical distribution (6.7), the asymptotic covariance 
matrix of fi has the form (6.6) (see Sections 6.17.1 and 6.17.7). 
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6.3.1 Collinearity 


If the data are collinear — that is, all points lie on a hyperplane H — the sample 
covariance matrix is singular and x € H. It follows from (6.16) that since fi is a 
linear combination of elements of H, it lies in H. Furthermore, (6.15) shows that py 
must be singular. In fact, if a sufficiently large proportion of the observations lie on 
a hyperplane, = must be singular (Section 6.2.2). But in this case £-1 , and hence the 
d;, do not exist and the M-estimator is not defined. 

To make the estimator well defined in all cases, it suffices to extend the definition 
in (6.4) as follows. Let A, = A, 2... 2 A, and b; (j = 1,...,p) be the eigenvalues 
and eigenvectors of £. For a given x, let z= bi(x- 4p). Since bj,...,b, are an 
orthonormal basis, we have 


Then, if z is not singular, we have (Problem 6.12): 


|, 


Pp 
ery, (6.19) 


On the other hand, if Z is singular, its smallest g eigenvalues are zero and in this case 
we define 


P-4 
R 27), if ee ae 
d(x, ji, 5) = 2 of /4; I Zp—q+l <p 0 (6.20) 
oo else 


which may be seen as the limit case of (6.19) when 4; | 0 for j > p — q. 

Note that d; enters (6.14)-(6.15) only through the functions W, and W,, which 
tend to zero at infinity, so this extended definition simply excludes those points that 
do not belong to the hyperplane spanned by the eigenvectors corresponding to the 
positive eigenvalues of &. 


6.3.2 Size and shape 


If one scatter matrix is a scalar multiple of another — that is, Yu = kX, — we say that 
they have the same shape, but different sizes. Several important features of the distri- 
bution, such as correlations, principal components and linear discriminant functions, 
depend only on shape. 

Let fi,, and z.. be the asymptotic values of location and scatter estimators at an 
elliptical distribution F defined in (6.7). It is shown in Section 6.17.2 that in this case 
fi, is equal to the center of symmetry pw, and Z,, is a constant multiple of Z, with 
the proportionality constant depending on F and on the estimator. This situation is 
similar to the scaling problem in (2.63) and at the end of Section 2.5. Consider in 
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particular an M-estimator at the distribution F = N,(u, 2%). By the equivariance of 


the estimator, we may assume that » = 0 and & = I. Then #,, = 0 and Pan = cI, and 
hence d(x, fl,,, 2.) = ||x||?/c. Taking the trace in (6.18) yields 


IIx|l? 2 

EW, (et IIx||° = pe. 
é 

Since ||x||? has a x; distribution, we obtain a consistent estimator of the covariance 

matrix & in the normal case by replacing & by &/c, with c defined as the solution of 


| W, (2) Z o(2)dz = p, (6.21) 
0 cl c 


where g is the density of the Pe distribution. 
Another approach to estimating the size of X is based on noting that if x ~ N(y, &), 
then d(x, uw, 2) ~ ae and the fact that & = c&,, implies 


cd(x, u, 2) = d(x, uw, E,,). 
Hence the empirical distribution of 
{d(x 5B) 5.0.5 d%, 5 3) 
will resemble that of d(x, fi, 5 =.) » which is c van and so we may estimate c robustly 


with 7 
Med{d(x, , fi, Z),...,d(x,,f,%)} 


730.5) 


n 
C= 


(6.22) 
where 77(o) denotes the a-quantile of the y> distribution. 


6.3.3. Breakdown point 


It is intuitively clear that robustness of the estimators requires that no term dominates 
the sums in (6.14)—(6.15), and to achieve this we assume 


WwW, (dyad, W,(d) and W,(d)d are bounded for d > 0. (6.23) 


Let 
K = sup W,(d)d. (6.24) 
d 


We first consider the asymptotic BP, which is easier to deal with. The “weak part” 
of joint M-estimators of yw and & is the estimator g, for if we take Z as known, then 
it is not difficult to prove that the asymptotic BP of ji is 1/2 (see Section 6.17.4.1). 
On the other hand, in the case where yu is known, the following result was obtained 
by Maronna (1976). If the underlying distribution Fo attributes zero mass to any 
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hyperplane, then the asymptotic BP of a monotone M-estimator of & with W, 
satisfying (6.23) is 
et = min(Z,1-2). (6.25) 
K K 
See Section 6.17.4.2 for a simplified proof. The above expression has a maximum 
value of 1/(p + 1), attained at K = p + 1, and hence 


ev< a (6.26) 
pti 
See also Tyler, 1990. 

Tyler (1987) proposed a monotone M-estimator with W,(d) = p/d, which corre- 
sponds to the multivariate ¢-distribution MLE weights (6.13) with degrees of freedom 
v | 0. Tyler showed that the BP of this estimator is e* = 1/p, which is slightly larger 
than the bound (6.26). This result is not a contradiction of (6.26), since W, is not 
defined at zero and hence does not satisfy (6.23). Unfortunately this unboundedness 
may make the estimator unstable. 

It is useful to understand the form of the breakdown under the assumptions (6.23). 
Take F = (1 — €)F) + €G, where G is any contaminating distribution. First let G be 
concentrated at Xj. Then the term 1/K in (6.25) is obtained by letting xp > oo, and 
the term 1 — p/K is obtained by letting x, + mu. Now consider a general G. For the 
joint estimation of wand &, Tyler shows that if € > e* and one lets G tend to 6,, then 
H > Xp and A,(2) — 0; that is, inliers can make & nearly singular. 

The FBP is similar, but the details are more involved (Tyler, 1990). Define a 
sample to be in general position if no hyperplane contains more than p points. 
Davies (1987) showed that the maximum FBP of any equivariant estimator for a 
sample in general position is m*,,./n, with 


max 


m 


n—p 

ae | . (6.27) 
2. 

It is therefore natural to search for estimators whose BP is nearer to this maximum 

BP than that of monotone M-estimators. 


6.4 Estimators based on a robust scale 


Just as with the regression estimators of Section 5.4, where we aimed at making 
the residuals “small”, we shall define multivariate estimators of location and scatter 
that make the distances d; “small”. To this end, we look for # and p3 minimizing 
some measure of “largeness” of d(x, fi, 3). If follows from (6.4) that this can be 
trivially done by letting the smallest eigenvalue of = tend to infinity. To prevent 
this, we impose the constraint || = 1. Call Sp the set of symmetric positive definite 


p X p matrices. For a dataset X, call d (X, ji, Z) the vector with elements d(x;, fi, S), 
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i= 1,...,n, and let 6 be a robust scale estimator. Then we define the estimators fi 
and & by 


G(d(X, @,&))=min with A € R’, FES, |S|=1. (6.28) 


Pp? 
It is easy to show that the estimators defined by (6.28) are equivariant. An equiv- 
alent formulation of the above goal is to minimize || subject to a bound on @ 
(Problems 6.7-6.9). 


6.4.1 The minimum volume ellipsoid estimator 


The simplest case of (6.28) is to mimic the approach that results in the LMS in 
Section 5.4, and let G be the sample median. The resulting location and scatter matrix 
estimator is called the minimum volume ellipsoid (MVE) estimator (Rousseeuw 
1985). The name stems from the fact that among all ellipsoids {x : d(x, uw, 2) < 1} 
containing at least half of the data points, the one given by the MVE estimator has 
minimum volume; that is, the minimum |Z]. The consistency rate of the MVE is the 
same slow rate as the LMS, namely only n~!/3, and hence is very inefficient (Davies, 
1992). 


6.4.2 S-estimators 


To overcome the inefficiency of the MVE we consider a more general class of esti- 
mators called S-estimators (Davies, 1987), defined by (6.28), taking for 6 an M-scale 


estimator that satisfies 
1 n d; 
—~) p(x) =4, (6.29) 
i=l o 


where p is a smooth bounded p-function. The same reasoning as in (5.24) shows 
that an S-estimator (1,2) is an M-estimator, in the sense that for any a> with 
|=| = 1 and ¢ = (a(x, fi, 2)), 


So ( Seb) 5 (See). (6.30) 
CG oO 


i=l i=1 


If p is differentiable, it can be shown (Section 6.17.5) that the solution to (6.28) 
must satisfy estimating equations of the form (6.14)—(6.15), namely 


LW (Z)o-m=0, (6.31) 
i=l o 


n d. ~ 
. >, W (2) (x; — fi)(x; — W)’ = ck, (6.32) 
at o 
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where 
W=)' and G = G(d,,...,.d,,), (6.33) 


and c is a scalar such that > = |. Note, however, that if p is bounded (as is the usual 
case), dW(d) cannot be monotone (Problem 6.5); in fact, for the estimators usually 
employed, W(d) vanishes for large d. Therefore, the estimator is not a monotone 
M-estimator, and so the estimating equations yield only Jocal minima of 6. 
The choice p(d) = d yields the average of the d; as scale estimator. In this case, 
W = | and hence 
=x, S=—— (6.34) 


where C is the sample covariance matrix. For this choice of scale estimator it follows 
that 


Y & - 9D; -¥ < DY & - VV; -v) (6.35) 
i=l i=1 


for all v and V with |V| = 1. 

It can be shown (Davies, 1987) that if p is differentiable, then for S-estimators 
the distribution of Vip — fi,,,& —X,,) tends to a multivariate normal. 

Similarly to Section 5.4, it can be shown that the maximum FBP (6.27) is attained 
for S-estimators by taking in (6.29): 
n—p 
Z|: 


We define the bisquare multivariate S-estimator as the one with scale given by 
(6.29), with 


nd = Mnax = 


p(t) = min{1,1—(1-7)°}, (6.36) 
which has weight function 
W(t) = 301-7 I(t < 1). (6.37) 


The reason for this definition is that in the univariate case the bisquare scale estima- 
tor — call it 7 for notational convenience — based on centered data x; with location jZ, 


is the solution of . 2 
1 X;-#H 
be <,{( ——] = 6, 6.38 
n 2 Phisq ( A ) ( ) 


where Poisg) = min{1,1—(1—?)*}. Since Poisg® = p(t?) for the p defined in 
(6.36), it follows that (6.38) is equivalent to 


Iu (@,-” 
rae a. 


with 6 = 7°. Now d(x, fl, &) is the normalized squared distance between x and yp, 
which explains the use of p. 
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6.4.3. The MCD estimator 


Another possibility is to use a trimmed scale for G instead of an M-scale, as was 
done to define the LTS estimator in Section 5.4.2. Let dj) < ... < dj, be the ordered 
values of the squared distances d; = d(x;, uw, Z), and for | < h < ndefine the trimmed 
scale of the squared distances as 


An estimator (fi, S) defined by (6.28) with this trimmed scale is called a minimum 
covariance determinant (MCD) estimator (Rousseeuw 1985). The reason for the 
name is the following: for each ellipsoid {x : d(x, t, V) < 1} containing at least h 
data points, compute the covariance matrix C of the data points in the ellipsoid. If 
(is Z) is an MCD estimator, then the ellipsoid with t = fi and V equal to a scalar 
multiple of = minimizes IC. 

As in the case of the LTS estimator in Section 5.4, the maximum BP of the MCD 
estimator is attained by taking h = n — mj, with mj,,, as defined in (6.27). 

Note that increasing h increases the efficiency, but at the cost of decreasing the 
BP. Paindaveine and Van Bever (2014) show that the asymptotic efficiency of MCD 
is very low. 


6.4.4 S-estimators for high dimension 


Consider the multivariate S-estimator with a bisquare p-function. Table 6.2 gives 
the asymptotic efficiencies of the scatter matrix vector under normality for different 
dimensions p and n = 10p (efficiency will be defined in (6.80)). 

It is seen that the efficiency is low for small p, but it approaches one for large 
p. The same thing happens with the location vector. It is shown in Section 6.17.6 
that this behavior holds for any S-estimator with a continuous weight function 
W = p’. This may seem like good news. However, the proof shows that for large 
p all observations, except those that are extremely far away from the bulk of the 
data, have approximately the same weight, and hence the estimator is approximately 
equal to the sample mean and sample covariance matrix. As a result, observations 
outlying enough to be dangerous may also have nearly maximum weight, and as a 
result, the bias can be very large (bias is defined in Section 6.7). It will be seen later 


Table 6.2 Efficiencies of the S-estimator with bisquare weights for dimension p 


p 2 > 10 20 30 40 50 
Efficiency 0.427 0.793 0.930 0.976 0.984 0.990 0.992 
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that this increase in efficiency and decrease in robustness with large p does not occur 
with the MVE. 

Rocke (1996) pointed out the problem just described, and proposed that the 
p-function change with dimension to prevent both the efficiency from increasing to 
values arbitrarily close to one and, correspondingly, the bias becoming arbitrarily 
large. He proposed a family of p-functions with the property that when p —> oo the 
function p approaches the step function p(d) = I(d > 1). The latter corresponds to 
the scale estimator G = Med (d) and so the limiting form of the estimator for large 
dimensions is the MVE estimator, 

Put, for brevity, d = d(x, Hoo &,.)- Itis shown in Section 6.17.6 that ifx is normal, 


then for large p 
p(£) <D(2) with z~ 72 
o P 


and hence d/o is increasingly concentrated around one. For large p, the re dis- 
tribution is approximately symmetric, with X(0.5) = Ez = p and pane! —a)-pe 
p- X;(@). Figure 6.4 shows the densities of z/p for p= 10 and 100, scaled to 
facilitate the comparison. Note that when p increases, the density becomes more 
concentrated around one. 

To have a sufficiently high (but not too high) efficiency, we should give a high 
weight to the values of d/o near one and downweight the extreme ones. A simple 
way to do this is to have W(t) = 0 for t between the a- and the (1 — a)-quantiles of 
d/o for some a € (0, 1). We now define a smooth p-function such that the resulting 
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Figure 6.4 Densities of z/p for p = 10 and 100 
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weight function W(t) vanishes for t € [1 — y, 1 + y], where y depends on a. Let 


) 
2(1 — a) 
y =min (a =f, ) , (6.40) 
Pp 


where X;(@) denotes the a-quantile of x. Define 


0 for O<t<l-y 
2 
J (Ht) }3_(21)] 41 _ 
p(t) = (2) [ (St)'|+$ tort y<t<l+y (6.41) 
1 for t>1t+y 


which has as derivative the weight function 


t-—1 


2 
wo= 2 |-(4) Ju-rsrsisn 6.42) 


Figure 6.5 shows the p- and W-functions corresponding to a = 0.005 for p = 10 
and 100. To simplify viewing, the weight functions are scaled so that max,W(f) = 1. 
When p increases, the interval on which W is positive shrinks, and p tends to the step 
function corresponding to the MVE. 


0 0.5 1 1.5 2 


Figure 6.5 p- and W-functions of Rocke estimator with a = 0.005 for p = 10 and 
100 
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Figure 6.6 Weight functions of Rocke (with 0.9 efficiency) and bisquare estimators 


The parameter a allows the user to control the efficiency (see more in 
Section 6.10). Figure 6.6 shows for p = 10 and 100 and n = 10p the weights of the 
Rocke estimator with finite-sample efficiency equal to 0.9, and those of the bisquare 
estimator, as a function of dp where d is the squared Mahalanobis distance. It is 
seen that the bisquare weight function descends slowly when p = 10, and is almost 
constant when p = 100, while the Rocke function assigns high weights to the bulk 
of the data and rapidly downweights data far from it, except possibly for very small 
values of p. 

Rocke’s biflat family of weight functions is the squared values of the W in (6.42). 
It is smoother at the endpoints but gives less weight to inner points. 


Example 6.3 The following data are part of a study on shape recognition. An 
ensemble of shape-feature extractors for the 2D silhouettes of different vehicles. The 
purpose is to classify a given silhouette as one of four types of vehicle, using a set of 
18 features extracted from the silhouette. The tables and figures for this example are 
obtained with script vehicle.R. 


The vehicle may be viewed from one of many different angles. The features are 
extracted from the silhouettes by an image processing system, which extracts a 
combination of scale-independent features utilizing both measures based on classical 
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Figure 6.7 Vehicle data. QQ-plots of squared Mahalanobis distances 


moments, such as scaled variance, skewness and kurtosis about the major/minor axes, 
and heuristic measures, such as hollows, circularity, rectangularity and compactness. 
The data were collected at the Turing Institute, Glasgow, and are available at https:// 
archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes). Here we deal with 
the “van” type, which has n = 217 cases. 

Figure 6.7 shows the chi-squared QQ-plots of the squared Mahalanobis distances 
from the classical estimator and the bisquare and Rocke S estimators. It is seen that 
the classical estimator finds no outliers, the bisquare S about 12, and the Rocke S 
about 30. The largest 10 distances for bisquare and Rocke correspond to the same 
cases. 

A more detailed analysis of the data shows that for 10 of the 18 variables, the 
values corresponding to the 30 cases pinpointed by the Rocke S are either the 12% 
lower or upper extreme. We can hence consider those cases as lying on a “corner” 
of the dataset, and we may therefore conclude that the Rocke S has revealed more 
structure than the bisquare S. 


6.4.5 t-estimators 


The same approach used in Section 5.4.3 to obtain robust regression estimators with 
controllable efficiency can be employed for multivariate estimation. Multivariate 
T-estimators were proposed and studied by Lopuhaa (1991). 
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This approach requires two functions, p, and p>. For given (yu, X) call oo(u, X) 


the solution of 
1 1a (“4”) = 


Then the estimator (fi, Z) is defined as the minimizer of the “r-scale” 


1x d(x;, H, XZ) 

,2) = o9(u.2)— — 

HB) = ond; Dee ( er) 
with |S | = |. Here 

pr) = 01 (*), (6.43) 


where c is chosen to regulate the efficiency. 

Originally, t-estimators were proposed to obtain estimators with higher efficiency 
than S-estimators for small p, which required c > 1; but for large p we need c < 1 in 
order to decrease the efficiency. 

It can be shown that (fi, £) satisfies estimating equations of the form (6.31)-(6.32) 
with 6 = o(fi,&), where W depends on p’ and p%,. 


6.4.6 One-step reweighting 


In order to increase the efficiency of a given estimator (fi, 3), a one-step reweight- 
ing procedure can be used in a similar way as shown in Section 5.9.1. Let W be a 
weight function. Given the estimator (7i, 3), define new estimators ju, Z as a weighted 
mean vector and weighted covariance matrix with weights W(d;), where the d; are the 
squared distances corresponding to fi and E. The most popular function is hard rejec- 
tion, corresponding to W(t) = I(t < k), where k is chosen with the same criterion as 
in Section 5.9.1. For ¢ defined in (6.22), the distribution of d;/C is approximately x 
under normality, and hence choosing k = Cy? , will reject approximately a fraction 
1 — f of the “good” data if there are no outliers. It is customary to take 6 = 0.95 or 
0.975. If the scatter matrix estimator is singular, we proceed as in Section 6.3.1. 

Although no theoretical results are known, simulations have showed this proce- 
dure improves the bias and efficiency of the MVE and MCD estimators. But it cannot 
be asserted that such improvement happens with any estimator. 

Croux and Haesbroeck (1999, Tables VII and VII) computed the finite-sample 
efficiencies of the reweighted MCD; although they are much higher than for the “raw” 
estimator, they are still low if one wants a high BP. 


6.5 MM-estimators 


We now present a family of estimators with controllable efficiency based on the same 
principle as the regression MM-estimators described in Section 5.5, namely: start 
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with a very robust but possibly inefficient estimator, and use it to compute a scale of 
the Mahalanobis distances and as a starting point to compute an M-estimator whose 
p-function has a tuning parameter. This was proposed by Lopuhaa (1992) and Tat- 
suoka and Tyler (2000) as a means of increasing the low efficiency of S estimators 
for small p. Here we give a simplified version of the latter approach. 

Let (figs 35) be an initial estimator. Put d? = d(x;5 fio; >a and call S the respec- 


tive M-scale 
i< d 
= — |=6. 6.44 
: 2 o( 5 ) (6.44) 


The estimator is defined by (fi, £) with |S | = | such that 


yy 4) = min (6.45) 
cS , , 


where d; = d(x,, (n E) and the constant c is chosen to control efficiency. 
It can be shown that the solution satisfies the equations 


I n d, a 
FLW (S) @-wo-w == (6.46) 


1 ~ d; = 
7 LW (S)a-m=o 


with W = p’, 

In fact, it is not necessary to obtain the absolute minimum in (6.45). As with the 
regression MM-estimators in Section 5.5, it is possible to show that any solution of 
(6.46) with the objective function (6.45) is lower that for the initial estimator, has the 
same asymptotic behavior as the absolute minimum and has BP at least as high as the 
initial estimator. 

As to the p-function, there are better choices than the traditional bisquare. Muler 
and Yohai (2002) employ a different p for time-series estimation, with a weight func- 
tion that is constant up to a certain value, and then descends rapidly but smoothly to 
zero. We shall call this smoothed hard rejection (SHR). Its version for multivariate 
estimation has a weight function 


1 if d<4 
Wour(d)=2 gid) if 4<d<9 , (6.47) 
0 if d>9 


where 
q(d) = —1.944 + 1.728d — 0.312d? + 0.016d? 


THE STAHEL—DONOHO ESTIMATOR 217 


Weights 


Distances 


Figure 6.8 Bisquare and SHR weight functions 


This is such that W is continuous and differentiable at d = 4 andd = 9. The respective 
p function is 

d if d<4 
sd) if 4<d<9 , (6.48) 


1 
Psur (a) = An 
6494 | 6494 if d>9 
where 


s(d) = 3.534 — 1.944d + 0.864d? — 0.104d? + 0.004d?. 


Figure 6.8 shows the bisquare and SHR weight functions, the former scaled for 
consistency and the latter for 90% efficiency for the normal, with p = 30. It is seen 
that SHR yields a smaller cutoff point, while giving more weight to small distances. 
The advantages of SHR will be demonstrated in Section 6.10. 


6.6 The Stahel—Donoho estimator 


Recall that the simplest approach to the detection of outliers in a univariate sample is 
the one given in Section 1.3: for each data point compute an “outlyingness measure” 
(1.4) and identify those points having a “large” value of this measure. The key idea 
for the extension of this approach to the multivariate case is that a multivariate outlier 
should be an outlier in some univariate projection. More precisely, given a direction 
a € R? with |lal| = 1, denote by a’X = {a’x,,...,a’x,,} the projection of the dataset 
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X along a. Let ff and G be robust univariate location and scatter statistics, say the 
median and MAD respectively. The outlyingness with respect to X of a point x € R? 
along a is defined, as in (1.4), by 


x’a — ji(a’X) 


(Xa) = Fax) 


The outlyingness of x is then defined by 
t(x) = max |f(x, a)|. (6.49) 
a 


In the above maximum, a ranges over the set { ||a|| = 1}, but in view of the equivari- 
ance of ff and 6, it is equivalent to take the set {a #0}. 

The Stahel—Donoho estimator, proposed by Stahel (1981) and Donoho (1982), is 
a weighted mean and covariance matrix, where the weight of x; is a nonincreasing 
function of t(x;). More precisely, let W, and W, be two weight functions, and define 


n 


x 1 
H= a 2 Miri (6.50) 
dia Yi 2 
Kz 1 a - a 
a > W(x; — B(x; — BW)’ (6.51) 


dizi Wn i=l 


with 
Wy = Wit(x;)), J = 1,2. (6.52) 


If y; = Ax; + b, then it is easy to show that f(y;) = ¢(x,) (¢ is invariant) and hence the 
estimators are equivariant. 

In order that no term dominates in (6.50)—(6.51) it is clear that the weight func- 
tions must satisfy the conditions 


tW,(t) and PW,(t) are bounded for ¢ > 0. (6.53) 


It can be shown (see Maronna and Yohai, 1995) that under (6.53) the asymptotic BP is 
1/2. For the FBP, Tyler (1994) and Gather and Hilker (1997) show that the estimator 
attains the maximum BP given by (6.27) if fi is the sample median and the scale is 


A ae 
o(z) = 5% + Zea) 


where Z; denotes the ordered values of |z; — Med(z)| and k = [(n + p)/2]. 

The asymptotic normality of the estimator was shown by Zuo et al. (2004a), 
and its influence function and maximum asymptotic bias were derived by Zuo et al. 
(2004b). 
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Note that since the W; in (6.52) may include tuning constants, these estimators 
have a controllable efficiency. There are several proposals for the form of these func- 
tions. A family of weight functions used in the literature is the “Huber weights” 


k 
w(t) = min (1 (<) ) (6.54) 
where k > 2 in order to satisfy (6.53). Maronna and Yohai (1995) used 


W,=W,= ee with c = a) ; (6.55) 


in (6.52) with Pp = 0.95. Zuo et al. (2004a) proposed the family of weights 


2 
ne me ee. 
We = min { 1,1 : exp | e(1 a+ 5) | \ (6.56) 


where c = Med(1/(1 + #(x))), k is a tuning parameter and b = 1 — e~*. 

Another choice is the SHR family, defined in (6.47), the advantages of which are 
demonstrated in Section 6.10. 

Simulations suggest that one-step reweighting does not improve the 
Stahel—Donoho estimator. 


6.7 Asymptotic bias 


This section deals with the definition of asymptotic bias for multivariate location 
and scatter estimators. We now deal with data from a contaminated distribution 
F =(1—€)Fo + €G, where Fy describes the “typical” data. In order to define bias, 
we have to define which are the “true” parameters to be estimated. For concreteness, 
assume Fy = N,,(fo, Xo), but note that the following discussion applies to any other 
elliptical distribution. Let ji,, and a be the asymptotic values of location and 
scatter estimators. 

Defining a single measure of bias for a multidimensional estimator is more com- 
plicated than in Section 3.3. Assume first that &) = I. In this case, the symmetry of 
the situation makes it natural to choose the Euclidean norm ||#i,,— Moll as a reason- 
able bias measure for location. For the scatter matrix, size is relatively easy to adjust, 
by means of (6.21) or (6.22), and it will be most useful to focus on shape. Thus we 
want to measure the discrepancy between =. and scalar multiples of I. The simplest 
way to do so is with the condition number, which is defined as the ratio of the largest 
to the smallest eigenvalue, 


cond(&,..) = Ai Bice) 
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The condition number equals | if and only if =e =clI for some c € R. Other func- 
tions of the eigenvalues may be used for measuring shape discrepancies, such as the 
likelihood ratio test statistic for testing sphericity (Seber, 1984), which is the ratio of 
the arithmetic to the geometric mean of the eigenvalues: 


trace(&,) 


Daa 


It is easy to show that in the special case of a spherical distribution, the asymptotic 
value of an equivariant E is a scalar multiple of I (Problem 6.3), and so in this case 
there is no shape discrepancy. 

For the case of an equivariant estimator and a general X), we want to define bias 
so that it is invariant under affine transformations; that is, the bias does not change 
if x is replaced by Ax + b. To this end we “normalize” the data so that it has iden- 
tity scatter matrix. Let A be any matrix such that A’A = Sys and define y = Ax. 
Then y has mean Apo and identity scatter matrix, and if the estimators fi and E are 
equivariant then then their asymptotic values based on data y; are Afi,, and Ax,,A’ 
respectively, where fi,, and py oo are their values based on data x;. Since their respective 
discrepancies are given by 


I|Aftgg — AMoll” = (Bo Mo) 25 "(Hoo Ho) and cond(AX,, A’) 


and noting that Ax, A’ has the same eigenvalues as Bi he , it is natural to define 


bias(fi) = 4/ (igg— Mo)/Zp! (io Mo) and bias(&) = cond(Z>'Z,.). (6.57) 


It is easy to show that if the estimators are equivariant, then (6.57) does not depend 
upon either fg or Ly. Therefore, to evaluate equivariant estimators, we may, without 
loss of generality, take wy = 0 and Xp = I. 

Adrover and Yohai (2002) computed the maximum asymptotic biases of several 
robust estimators. However, in order to compare the performances of different esti- 
mators we shall, in Section 6.10, employ a measure that takes both bias and variability 
into account. 


6.8 Numerical computing of multivariate estimators 


6.8.1 Monotone M-estimators 


Equations (6.15) and (6.16) yield an iterative algorithm similar to the one for regres- 
sion in Section 4.5. Start with initial estimators fig and Xo, for example the vector of 
coordinate-wise medians and the diagonal matrix with the squared normalized MADs 
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of the variables in the diagonal. At iteration k, let d,; = d(x;, M;, =) and compute 


a Diet Wid g ly x ~ 
May = SoMa Ziel = ; p Wo (dei)(%; — Bey s%i — Aer! (6.58) 


If, at some iteration, 3 ; becomes singular, it suffices to compute the d; through (6.20). 
The convergence of the procedure is established in Section 9.5. Since the solution is 
unique for monotone M-estimators, the starting values influence only the number of 
iterations but not the end result. 


6.8.2 Local solutions for S-estimators 


Since local minima of 6 are solutions of the M-estimating equations (6.31)-(6.32), a 
natural procedure to minimize G is to use the iterative procedure (6.58) to solve the 
equations, with W, = W, equal to W = 9’, as stated in (6.33). It must be recalled that 
since tW(t) is redescending, this pair of equations yields only a Jocal minimum of o, 
and hence the starting values chosen are essential. Assume for the moment that we 
have the initial fig and z; (their computation is considered in Section 6.8.5). 

At iteration k, call fi, and s the current values and compute 


te, OX A A d, 1 
dy: = U(X, fy, Dy) Fy = F(dyqs +s jy)s Wy = W (3) (6.59) 
k 


Then compute 


x Diet MuXi A - x x EN CG 
Bi = ae > W(X — Ayr % — Beri)’, Zen = ———.- (6.60) 
Diet Ki i=l |C,|!/P 


It is shown in Section 9.6 that if the weight function W is nonincreasing, then ¢, 
decreases at each step. One can then stop the iteration when the relative change 
(Gy — G441)/G, is below a given tolerance. Experience shows that since the decrease 
of G, is generally slow, it is not necessary to recompute it at each step, but at, say, 
every tenth iteration. 

If W is not monotonic, the iteration steps (6.59)-(6.60) are not guaranteed to 
cause a decrease in G, at each step. However, the algorithm can be modified to 
ensure a decrease at each iteration. Since the details are involved, they are deferred 
to Section 9.6.1. 


6.8.3 Subsampling for estimators based on a robust scale 


The obvious procedure to generate an initial approximation for an estimator defined 
by (6.28) is follow the general approach for regression described in Section 5.7.2, in 
which the minimization problem is replaced by a finite one, in which the candidate 
estimators are sample means and covariance matrices of subsamples. To obtain a 
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finite set of candidate solutions, take a subsample of size p + 1, {x; : i € J}, where 
the set J Cc {1,...,2} has p+ 1 elements, and compute 


n 


Z a « 
fi, = ave,e/(x;) and £, = —"-, (6.61) 
IC, | 1/p 
where C is the covariance matrix of the subsample; and let 


The problem of minimizing 6 is thus replaced by the finite problem of minimizing 


o(d,) over J. Since choosing all (i. subsamples is prohibitive unless both n and 


1 
p are rather small, we choose N of them at random, {J, : k= 1,...,N}, and the 


estimators are Wy, with 


k* = arg min +i o(d,,). (6.63) 


een 


If the sample contains a proportion € of outliers, the probability of at least one 
“ood” subsample is 1 — (1 — a)", where a = (1 — €)?*!. If we want this probability 
to be larger than | — 6 we must have 


[Ind] [Ind | 
= Ind —a@)| (1 —e)Pt! 


(6.64) 


See Table 5.3 for the values of N required as a function of p and €. 
A simple but effective improvement of the subsampling procedure is as follows. 
For subsample J with distances d, defined in (6.62), let t = Med(d,) and compute 


* 


Hy = ave{x; dj, <t}, Cy = Var{x;: dj; <t}, L= (6.65) 


i wi 
[Ce |/P 


Then use yl? and C7 instead of fi, and c y The motivation for this idea is that a sub- 
sample of p + 1 points is too small to yield reliable means and covariances, and so it 
is desirable to enlarge the subsample in a suitable way. This is done by selecting the 
half-sample with the smallest distances. Although no theoretical results are known for 
this method, our simulations show that this extra effort yields a remarkable improve- 
ment in the behavior of the estimators, with only a small increase in computational 
time. In particular, for the MVE estimator, the minimum scale obtained this way is 
always much smaller than that obtained with the original subsamples. For example, 
the ratio is about 0.3 for p = 40 and 500 subsamples. 
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6.8.4 The MVE 


We have seen that the objective function of S-estimators can be decreased by iteration 
steps, and the same thing happens with the MCD (Section 6.8.6). However, no such 
improvements are known for the MVE, which makes the outcome of the subsampling 
procedure the only available approximation to the estimator. 

The simplest approach to this problem is to use directly the “best” subsample 
given by (6.63). However, in view of the success of the improved subsampling method 
given by (6.65), we make it our method of choice for computing the MVE. 

An exact method for the MVE was proposed by Agull6 (1996) but, since it is not 
feasible except for small n and p, we do not describe it here. 


6.8.5 Computation of S-estimators 


Once we have initial values jig and Si an S-estimator is computed by means of 
the iterative procedures described in Section 6.8.2. We present two approaches to 
compute fi and Sp. 

The simplest approach is to obtain initial values of Wp, Xp) through subsampling, 
and then apply the iterative algorithm. Much better results are obtained by follow- 
ing the same principles as the strategy described for regression in Section 5.7.3. 
This approach was first employed for multivariate estimation by Rousseeuw and van 
Driessen (1999) (see Section 6.8.6). 

Our preferred approach, however, proceeds as was done in Section 5.5 for com- 
puting the MM-estimators of regression; that is, start the iterative algorithm from a 
bias-robust but possibly inefficient estimator. The MVE appears to be a candidate for 
this task. Adrover and Yohai (2002) computed the maximum asymptotic biases of the 
MVE, MCD, Stahel—Donoho and Rocke estimators, and concluded that the MVE has 
the smallest maximum bias for p > 10. 

It is important to note that although the MVE estimator has the unattractive feature 
of a slow n7!/3 rate of consistency, this feature does not affect the efficiency of the 
local minimum that is the outcome of the iterative algorithm, since it satisfies the 
M-estimating equations (6.31)—(6.32); if equations (6.17)—(6.18) for the fi,,, z. have 
a unique solution, then all solutions of (6.31)—(6.32) converge to (f1,,, pai) with a rate 
of order n7!/?, 

The MVE computed using the improved method (6.65) is an option for the initial 
estimator. However, it will be seen in Section 6.10 that a more complex procedure, 
to be described in Section 6.9.2, yields better results. 

Other numerical algorithms have been proposed: by Ruppert (1992) and 
Woodruff and Rocke (1994). 


6.8.6 The MCD 


Rousseeuw and van Driessen (1999) found an iterative algorithm for the MCD, 
based on the following fact. Given any ji, and ,, let d; be the corresponding squared 


224 MULTIVARIATE ANALYSIS 


distances. Then compute ji, and Cas the sample mean and covariance matrix of the 
data with the h smallest of the djs, and set E, = C/|€|!/”. Then > and 3, yield 
a lower value of @ in (6.39) than fi; and &,. This is called the concentration step 
(“C-step” in the above paper), and a proof of the above reduction in G is given in 
Section 9.6.2. In this case, the modification (6.65) is not necessary, since the con- 
centration steps already perform this sort of modification. The overall strategy then 
is as follows: for each of N candidate solutions obtained by subsampling, perform, 
say, two of the above steps, keep the 10 out of N that yield the smallest values of the 
criterion, and starting from each of them iterate the C-steps to convergence. 


6.8.7. The Stahel—Donoho estimator 


No exact algorithm for the Stahel—Donoho estimator is known. To approximate the 
estimator we need a large number of directions, and these can be obtained by sub- 
sampling. For each subsample J = {x;,,...,%;,} of size p, let ay be a vector of norm 
1 orthogonal to the hyperplane spanned by the subsample. The unit length vector a, 
can be obtained by applying the QR orthogonalization procedure (see, for example, 
Chambers (1977) to {X;, a eee so x,,b}, where x, is the average of the sub- 
sample and b is any vector not collinear with x;. Then we generate N subsamples 
J,,...,Jy and replace (6.49) by 


t(x)= mak t(Xx, ay,)- 


It is easy to show that 7 is invariant under affine transformations, and hence the 
approximate estimator is equivariant. 


6.9 Faster robust scatter matrix estimators 


Estimators based on a subsampling approach will be too slow when p is large; for 
example, of the order of a few hundred. We now present two faster methods for 
high-dimensional data. These are based on projections. The first is deterministic and 
is based on pairwise robust covariances. The second combines deterministic direc- 
tions that yield extreme values of the kurtosis with random ones obtained by an 
elaborate procedure. 


6.9.1 Using pairwise robust covariances 


Much faster estimators can be obtained if equivariance is given up. The simplest 
approaches for location and dispersion are respectively to apply a robust location 
estimator to each coordinate and a robust estimator of covariance to each pair of 
variables. Such pairwise robust covariance estimators are easy to compute, but 
unfortunately the resulting scatter matrix lacks affine equivariance and positive 
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definiteness. Besides, such estimators for location and scatter may lack both bias 
robustness and high normal efficiency if the data are very correlated. This is because 
the coordinate-wise location estimators need to incorporate the correlation structure 
for full efficiency for the normal distribution, and because the pairwise covariance 
estimators may fail to downweight higher-dimensional outliers. 

A simple way to define a robust covariance between two random variables x, y is 
by truncation or rejection. Let y be a bounded monotone or redescending y-function, 
and y(.) and o(.) robust location and scatter statistics. Then robust correlations and 
covariances can be defined as 


RCov(x, y) = o(@o(y)E \v (=) w (2 — HY a] (6.66) 


o(x) o(y) 
[RCov(x, x)RCov(y, y)]!/2 


See Huber and Ronchetti (2009; Sec 8.2-8.3). This definition _ satisfies 
RCorr(x, x) = 1. When w(x) = sgn(x) and yw is the median, (6.67) and (6.70) 
are called the quadrant correlation and covariance estimators. The sample versions 
of (6.66) and (6.67) are obtained by replacing the expectation by the average, and py 
and o by their estimators ## and G. 

These estimators are not consistent under a given model. In particular, if 
D(x, y) is bivariate normal with correlation p and yw is monotone, then the value 
Pr Of RCorr(x, y) is an increasing function pp = g(p) of p, which can be computed 
(Problem 6.11). Then, the estimator Pp of pp can be corrected to ensure consistency 
for the normal model by using the inverse transformation p = g7!(pp). 

Another robust pairwise covariance, initially proposed by Gnanadesikan and Ket- 
tenring (1972) and later studied by Devlin et al. (1981), is based on the identity 


RCorr(x, y) = (6.67) 


Cov(x, y) = +(SD Gey = SD 5). (6.68) 


The proposal defined a robust correlation by replacing the standard deviation by a 
robust scatter o (they chose a trimmed standard deviation): 


2 2 
mes oe eee ee eee 
RCorr(x, y) = Z (- ( ne + =.) o ( 5G) =) ) ; (6.69) 


A robust covariance is defined by 
RCov(x, y) = o(x)o(y)RCorr(x, y). (6.70) 
The latter satisfies 
RCov(t)x, Hy) = ty tRCov(x, y) forall t,, ER (6.71) 


and 
RCov(x, x) = o(x)’. 


Note that dividing x and y by their os in (6.69) is required for (6.71) to hold. 
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The above pairwise robust covariances can be used in the obvious way to define 
a “robust correlation (or covariance) matrix” of a random vector x = (x),... sa) : 
The resulting scatter matrix is symmetric but not necessarily positive semidefinite, 
and is not affine equivariant. Genton and Ma (1999) calculated the influence function 
and asymptotic efficiency of the estimators of such matrices. It can be shown that the 
above correlation matrix estimator is consistent if D(x) is an elliptical distribution, 
and a proof is given in Section 6.17.10. 

Maronna and Zamar (2002) show that a simple modification of Gnanadesikan and 
Kettenring’s approach yields a positive definite matrix and “approximately equivari- 
ant” estimators of location and scatter. Recall that if & is the covariance matrix of the 
p-dimensional random vector x, and o denotes the standard deviation, then 


o(a’x)? = a'Za (6.72) 


for all a € R’. The lack of positive semidefiniteness of the Gnanadesikan—Kettenring 
matrix is overcome by a modification that forces (6.72) for a robust o and a set 
of “principal directions”, and is based on the observation that the eigenvalues 
of the covariance matrix are the variances along the directions given by the 
respective eigenvectors. 

Let X = [x;] be an n X p data matrix with rows Xi, i=1,...,n, and columns x’, 
j=1,...,p. Let G(.) and fi(.) be robust univariate location and scatter statistics. For 
a data matrix X, we shall define a robust scatter matrix estimator E(x) and a robust 
location vector estimator fi(X) by the following computational steps: 


1. First compute a normalized data matrix Y with columns y/ = x//G(x/), and hence 
with rows 


y= D-'x; (i=1,...,n) where D=diag(G(x!), ..., G(x?)). (6.73) 


2. Compute a robust “correlation matrix” U =[U;,] of X as the “covariance matrix” 
of Y by applying (6.69) to the columns of Y: 


U,=1, U_= i [eq + y'? - Gy’ -y?] G#. 


3. Compute the eigenvalues A; and eigenvectors e; of U (j= 1,...,p), and let E 
be the matrix whose columns are the e;. It follows that U = EAE’, where 
A = diag(A,,...,4,). Here the A; need not be nonnegative. This is the “principal 
component decomposition” of Y. 

4. Compute the matrix Z with 


Z=2Ey,= 2 D's, G=1,.,9) (6.74) 


so that (z!, ...,z?) are the “principal components” of Y. 
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5. Compute 6(z/) and ji(z’) for j = 1,..., p, and set 
T = diag (6(z')’,....6(2?)"), v= (f@(z'),..., M(2P)Y’. 


Here the elements of I’ are nonnegative. Being “principal components” of Y, the 
z should be approximately uncorrelated with covariance matrix I. 
6. Now transform back to X with 


x,=Az, with A=DE. (6.75) 


and finally define 
X(X) = ATA’, fi(X) = Av. (6.76) 


The justification for the last equation is that, if v and were the mean and covariance 
matrix of Z, since x; = Az; the mean and covariance matrix of X would be given 
by (6.76). 

Note that (6.73) makes the estimator scale equivariant, and that (6.76) replaces 
the A;, which may be negative, with the “robust variances” o(z/)? of the correspond- 
ing directions. The reason for defining fi as in (6.76) is that it is better to apply 
a coordinate-wise location estimator to the approximately uncorrelated z/ and then 
transform back to the X-coordinates than to apply a coordinate-wise location estima- 
tor directly to the x/s. 

The procedure can be iterated in the following way. Put Ho) = ji(X) and 
ie = E(x). At iteration k, we have fig, and Eis whose computation has required 
computing a matrix A, as in (6.75). Call Z,) the matrix with rows z; = A7'x;. Then 
Ha 41) and > +1) are obtained by computing = and fi for Zi and then expressing 
them back in the original coordinate system. More precisely, we define 


Fea (&) = ALw(DA', fiery) =ARw(Z). (6.77) 


The reason for iterating is that the first step works very well when the data have low 
correlations; and the z/s are (hopefully) less correlated than the original variables. 
The resulting estimator will be called the “orthogonalized Gnanadesikan—Kettenring 
estimator” (OGK). 

A final step is convenient both to increase the estimator’s efficiency and to make 
it “more equivariant”. The simplest and fastest option is the reweighting procedure 
in Section 6.4.6. But it is much better to use this estimator as the starting point for 
the iterations of an S-estimator. 

Since a large part of the computing effort is consumed by the univariate esti- 
mators ff and G, they must be fast. The experiments by Maronna and Zamar (2002) 
showed that it is desirable that ag and @ be both bias robust and efficient for the 
normal distribution in order for & and fi to perform satisfactorily. To this end, the 
scatter estimator 6 is defined in a way similar to the t-scale estimator (5.27), which 
is a truncated standard deviation, and the location estimator f# is a weighted mean. 
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More precisely, let ji) and Gy be the median and MAD. Let W be a weight function 
and p a p-function. Let w; = W((x; — fiy)/Gp) and 


a Limi; ny o5 (=*) 
f= —, F=—) 1p ; 
yi n 2 


6 


An adequate balance of robustness and efficiency is obtained with W the bisquare 
weight function (2.57) with k = 4.5, and p as the bisquare p (2.38) with k = 3. 

It is shown by Maronna and Zamar (2002) that if the BPs of f@ and @ are not less 
than « then so is the BP of (fi, 2), as long as the data are not collinear. Simulations in 
their paper show that two is an adequate number of iterations (6.77), and that further 
iterations do not seem to converge and yield no improvement. 

An implementation of the OGK estimator for applications to data mining was 
discussed by Alqallaf et al. (2002), using the quadrant correlation estimator. A reason 
for focusing on the quadrant correlation was the desire to operate on huge datasets 
that are too large to fit in computer memory. A fast bucketing algorithm can be used 
to compute this estimator on “streaming” input data (data read into the computer 
sequentially from a database). The median and MAD estimators were used for robust 
location and scatter because there are algorithms for the approximate computation of 
order statistics from a single pass on large streaming datasets (Manku et al. 1999). 


6.9.2 The Pena—Prieto procedure 


The kurtosis of a random variable x is defined as 


E(x — Ex)* 


Kurt () = a 


Pefia and Prieto (2007) propose an equivariant procedure based on the following 
observation. A distribution is called unimodal if its density has a maximum at some 
point xp, and is increasing for x < xg and decreasing for x > x9. Then it can be shown 
that the kurtosis is a measure of both heavy-tailedness and unimodality. It follows 
that, roughly speaking, for univariate data a small proportion of outliers increases the 
kurtosis, since it makes the data tails heavier, and a large proportion decreases the 
kurtosis, since it makes the data more bimodal. 

Hence Pefia and Prieto look for projections that either maximize or minimize the 
kurtosis, and use them in a way similar to the Stahel—Donoho estimator. 

At the same time, they point out that ensuring a high probability of getting “good” 
directions (those that detect outliers) by ordinary subsampling, as in Section 6.8.7, 
can be very low unless the number of directions is extremely large. For this reason 
they also employ a number of random “specific directions”, which are obtained by a 
kind of stratified sampling. The intuitive idea is that if one selects two observations 
at random, project the data in the direction defined by them and then selects the sub- 
sample from the extremes of the projected data, one can increase the probability of 
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generating “good” directions, because the proportion of good or bad observations in 
the extremes is expected to be greater than in the whole sample. 

The procedure is complex, but it may be summarized as follows for p-dimensional 
data: 


1. In Stage I of the procedure, two sets of p directions a are found; one corresponding 
to local maxima and the other to local minima of the kurtosis. 

2. In Stage I, a number L of directions are generated, each defined by two random 
points; the space is cut into a number of “slices” orthogonal to each direction, and 
a subsample is chosen at random from each slice, which defines a direction. 

3. The outlyingness of each data point is measured through (6.49), with the vector a 
ranging over the directions computed in Stages I and II. 

4. Points with outlyingness above a given threshold are transitorily deleted, and 

(1)-(2) are iterated on the remaining points until no more deletions take place. 

. The sample mean and covariance matrix of the remaining points is computed. 

6. Deleted points whose Mahalanobis distances are below a threshold are again 
included, and steps (4)—(5) are repeated until no more inclusions take place. 


Nn 


The procedure is very fast for high dimensions. Maronna (2017) found out that 
in certain extreme situations when the contamination rate is “high” (> 0.2) and the 
ratio n/p is “low” ( < 10), KSD may be unstable and yield useless values. For this 
reason he proposed two simple modifications that largely correct this drawback with 
only a small increase in computing time. 

Although there are no full theoretical results for the breakdown point of KSD, 
Maronna and Yohai (2017) show that its asymptotic BP is 0.5 for point-mass contam- 
ination under elliptical distributions, and the simulations by Pefia and Prieto (2007) 
also suggest that it has a high FBP. 

In Section 6.10 we shall employ this procedure as a starting point for the compu- 
tation of MM-estimators and of estimators based on robust scales. 


6.10 Choosing a location/scatter estimator 


So far, we have described several types of robust location/scale estimators, and the 
purpose of this section is to give guidelines for choosing among them, taking into 
account their efficiency, robustness and computing times. Our results are based on an 
extensive simulation study by Maronna and Yohai (2017). 

The first issue is how we measure the performance of an estimator (fi, 3) 
given a central model N,,(Mp,2o). As in Section 5.9.3.2, we shall employ the 
Kullback—Leibler divergence, defined in (5.59) (henceforth D for short). It is 
straightforward to show that in the normal family, for ~ with known &, we have 


D (fi) = i — Mo)! — Ko): (6.78) 
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and for & with known p we have 
D (&) = trace (X5'E) — log |Z>'E| — p. (6.79) 

The normal finite-sample efficiency is then defined as 

ED (x) eff (2) = ED(C) 


a ff (2) mY 
ED (jp) ED (2) 


eff (i) = (6.80) 
where x and C are the sample mean and covariance matrix, computed from samples 
of size n. 

The elements to be compared are: 


e The estimators: We consider four types of estimators for which the efficiency 
can be controlled without affecting the BP: Rocke, MM, 7 and Stahel-Donoho. 
We add for completeness four other estimators with uncontrollable efficiency: the 
S-estimator (S-E), the MVE, the MCD with one-step reweighting, and Pefia and 
Prieto’s “KSD” estimator described in Section 6.9.2. In all cases except for the 
KSD the scales were tuned to attain the maximum FBP (6.27). As explained at the 
end of Section 6.9.2, there is strong evidence to believe that the FBP of KSD is 
also near the maximum one given by (6.27). All scatter estimators were corrected 
for “size” by means of (6.22), 

The starting values: For Rocke, MM, t and S-E we compare the MVE and KSD 
estimators as starting values. For Stahel-Donoho we compare the directions sup- 
plied by KSD with those obtained by subsampling. 

e The p— (or W— ) functions: For MM, t and S-E we compare the bisquare and SHR 

ps; for Stahel-Donoho we compare SHR and Huber weights (6.55). 


6.10.1 Efficiency 


For all estimators with controllable efficiency, the tuning constants for each estimator 
are chosen to make the finite-sample efficiency of = equal to 0.90. This priority of 
E over fi is due to the fact that, as explained in Section 6.17.4, robustly estimating 
scatter is “more difficult” than location. The simulations showed that in all cases 
eff (fi) > eff (2). A preliminary simulation was performed for each estimator. Its 
tuning constants were computed for n = Kp with K = 5, 10 and 20 and p between 
5 and 50, and were then fitted as simple functions of n and p. Relevant values are 
given in Section 6.10.4. 

A simulation was run to estimate (6.80), replacing the expected values ED by 
their Monte Carlo averages D computed over 1000 samples of size n from N,,(Mo, &o) 
with (Wo, Zp) = (0, I); since all estimators are equivariant, the results do not depend 
on (Mp, Xo). The simulation showed the efficiency cannot be controlled in all cases: 


e For p = 10 the efficiency of the Rocke scatter estimator is 0.73, and is still lower 
for smaller p. The explanation is that when a tends to zero, the estimator does not 
tend to the covariance matrix unless p is large enough. 


CHOOSING A LOCATION/SCATTER ESTIMATOR 231 


e The minimum efficiency of the r-estimators over all constants c tends to 1 with 
increasing p, for both p-functions. In particular, it is > 0.95 for p > 50. The rea- 
son is that when c is small, the t-scale approaches the M-scale, and therefore the 
t-estimator approaches the S-estimator. 

e The simulations under contamination, described in the next section, showed that 
when p < 10, the MM- estimator with efficiency 0.90 is too sensitive to outliers, 
and therefore lower efficiencies were chosen for those cases. 


As explained in Sections 6.4.1, 6.4.3 and 6.4.6, the efficiency of the MVE is very 
low, and that of MCD is low unless one accepts a very low BP. 

As to KSD, its efficiency depends on n and p: it is less than 0.5 for n = 5p and 
less than 0.8 for n = 10p. 


6.10.2 Behavior under contamination 


To assess the estimators’ robustness, a simulation was run with contaminated data 
from N,(, I). For each estimator and scenario, the average D of (6.78) and (6.79) 
was computed. Given the contamination rate € € (0, 1) let m = [ne], where n is the 
sample size. The first coordinate x,, of x; (i = 1,...,m) is replaced by yx;, + K, where 
K is the outliers’ size and the constant y determines the scatter of the outliers. Here 
K is varied between 1 and 12 in order to find the maximum D for all estimators. The 
values chosen for € were 0.1 and 0.2, and those for y were 0 and 0.5. 

The simulations were run for p = 5, 10, 15, 20 and 30, and n = mp with m = 5, 
10 and 20. Each estimator was evaluated by its maximum D. The most important 
conclusions are the following: 


The price paid for the high efficiency of S-E is a large loss of robustness. 

KSD is always better than MVE as a starting estimator for MM and t. 

KSD is generally better than subsampling for Stahel—Donoho. 

The SHR p is always better than the bisquare p for both MM and t. 

The SHR weights are better than Huber weights for Stahel—Donoho. 

In all situations, the best estimators are MM and t with SHR p, Rocke, and 
Stahel—Donoho, all starting from KSD. 

Although the results for y = 0 and 0.5 are different, the comparisons among 
estimators are almost the same. 

e The relative performances of the estimators for location and scatter are similar. 

e The relative performances of the estimators for n = 5p, 10p and 20p are similar. 


Table 6.3 shows a reduced version of the results, for n = 10p and y = 0, of the 
maximum Ds of the scatter estimators corresponding to MM and t (both with SHR p), 
Rocke and Stahel—Donoho (S-D), all starting from KSD. For completeness, we add 
S-E with KSD start, and MCD. The results for estimators with efficiency less than 
0.9 are shown in italics. 
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Table 6.3. Maximum mean Kullback—Leibler divergences of scatter matrices, for 
n= 10p and y = 0. Italics denote estimators with less than 90% efficiency 


Dp E MM T Rocke S-D S-E MCD 


5 0.1 0.85 0.89 0.95 0.99 1.09 1.99 
0.2 2.27 2.46 2.61 4.53 4.38 17.58 
10 0.1 1.67 1.77 1.43 1.61 3.54 6.66 
0.2 3.88 4,53 3.48 7.94 11.26 21.89 
15 0.1 2.38 2.98 1.95 2.26 6.68 12.53 
0.2 5.68 7.85 4.47 12.31 19.82 28.33 
20 0.1 3.32 4.59 2.49 3.00 10.03 16.46 
0.2 7.90 12.62 3.17 17.09 25.41 32.04 
30 0.1 5.34 8.56 3.03 4.64 18.39 17.66 
0.2 14.21 20.71 5.61 29.66 49.14 34.02 


It is seen that: 


The performance of S-D is competitive for ¢ = 0.1, but is poor for € = 0.2. 
For p < 10, MM has the best overall performance. 

For p => 10, Rocke has the best overall performance. 

The MCD has poor performance. 


Figure 6.9 shows the values of Dasa function of the outlier size K for some of the 
estimators in the case p = 20, n = 200 and y = 0. Here “MM-SHR” stands for “MM 
with SHR p”. All estimators in the lower panel start from KSD. The plot confirms 
the superiority of Rocke with KSD start. 


6.10.3 Computing times 


We compare the computing times of the Rocke estimator with MVE and KSD starts. 
The results are the average of 20 runs with normal samples, on a PC with Intel TM12 
Duo CPU and 3.01 GHz clock speed. The values of n were 5p, 10p and 20p, with p 
between 20 and 100. 

The rational way to choose the number of subsamples W,,,,, for the MVE would 
be to ensure a given (probabilistic) breakdown point. But according to (6.64), the 
values of N,,,, that ensure a breakdown point of just 0.15 is 18298 for p = 50 and 
6.18 x 10’ for p = 1001; these are impractically large. For this reason, Neup Was 
chosen to increase more slowly so as to yield feasible computing times, namely as 
N. sub = 50p : 

Table 6.4 displays the results, where for brevity we only show the values for 
p = 20, 50, 80 and 100. It is seen that Rocke+KSD is faster than Rocke+MVE. 
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Mean D 


2 4 6 8 10 12 


Figure 6.9 Mean D for p = 20, n = 200, € = 0.1 and y = 0 with outlier size K 


6.10.4 Tuning constants 


For the MM-estimator with KSD start and SHR, c is approximated by 
c=at : + cb, 
pon 


with a = 0.612, b = 4.504 and c = —1.112. 
For the Rocke estimator with KSD, the value of a which yields 90% efficiency is 
approximated for p > 15 by 


a = ap’n® 


with a = 0.00216, b = —1.0078, c = 0.8156. 


6.10.5 Conclusions 


The Rocke estimator has a controllable efficiency for p > 15. With equal efficiencies, 
the Rocke estimator with KSD start outperforms all its competitors. Its computing 
time is competitive for p < 100. Bearing in mind the trade-off between robustness 
and efficiency, we recommend the Rocke estimator for p > 10 and the MM-estimator 
with SHR for p < 10 (in both cases with KSD start). 
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Table 6.4 Mean computing times of estimators 


Pp n Rocke+MVE Rocke+KSD 


(s) (s) 

20 100 0.62 0.09 
200 0.98 0.10 

400 1.31 0.15 

50 250 5.03 0.46 
500 6.54 1.02 

1000 12.72 2.59 

80 400 14.55 4.20 
800 22.46 10.01 

1600 65.45 16.73 

100 500 28.86 27.56 
1000 74.01 55.69 

2000 152.06 65.76 


6.11 Robust principal components 


Principal components analysis (PCA) is a widely used method for dimensionality 
reduction. Let x be a p-dimensional random vector with mean pw and covariance 
matrix &. The first principal component is the univariate projection of maximum 
variance; more precisely, it is the linear combination x’ b,, where b, (called the first 
principal direction) is the vector b such that 


Var(b’x) = max subject to ||b|| = 1. (6.81) 


The second principal component is x’b,, where b, (the second principal direction) 
satisfies (6.81) with bib, = 0, and so on. Call A; > A, 2... 2 A, the eigenvalues of 


x. Then b;,...,b, are the respective eigenvectors and Var(b/x) = 4,. The number g 
of components can be chosen on the basis of the “proportion of unexplained variance” 
are, 
aes (6.82) 

py ij=1 Aj 


PCA can be viewed in an alternative geometric form in the spirit of regression 
modeling. Consider finding a g-dimensional hyperplane H such the orthogonal dis- 
tance of x to H is the “smallest”, in the following sense. Call X;, the point of H closest 
in Euclidean distance to x; that is, such that 


X, = arg min||x —z|| 
ss Pee : 
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Then we look for H* such that 
E [|x — &y« ||? = min. (6.83) 


It can be shown (Seber, 1984) that H* contains the mean yp and has the direc- 
tions of the first g eigenvectors b,,...,b,, and so H™ is the set of translated linear 


ria Digs 
combinations of by, ..., b, : 


q 
H* = {ee dean cunt Rh. (6.84) 
k=1 


Then 
z=(x-p'b, G=1L..9 (6.85) 


are the coordinates of the centered x in the coordinate system of the b;, and 


P q 
dy= )) f=(x-8yl, do= De (6.86) 
j=qtl jel 


are the squared distances from x to H and from X,, to , respectively. 

Note that the results of PCA are not invariant under general affine transforma- 
tions, in particular under changes in the units of the variables. Doing PCA implies 
that we consider the Euclidean norm to be a sensible measure of distance, and this 
may require a previous rescaling of the variables. PCA is, however, invariant under 
orthogonal transformations; that is, transformations that do not change Euclidean 
distances. 

Given a dataset X ={x,,...,X,,}, the sample principal components are computed 
by replacing wz and & by the sample mean and covariance matrix. For each observation 
x;, we compute the scores Z;; = (x; — x)’ b; and the distances 


tq,= ¥ 3=[x-%[. des= V3. (6.87) 


j=qtl j=l 


where #7 is the estimated hyperplane. A simple data analytic tool, similar to the plot 
of residuals versus fitted values in regression, is to plot dij against dq. 

As can be expected, outliers may have a distorting effect on the results. For 
instance, in Example 6.1, the first principal component of the correlation matrix of 
the data explains 75% of the variability, while after deleting the atypical point it 
explains 90%. The simplest way to deal with this problem is to replace x and Var(X) 
with robust estimators fi and = of multivariate location and scatter. Campbell (1980) 
uses M-estimators. Croux and Haesbroeck (2000) discuss several properties of this 
approach. Note that the results depend only on the shape of z (Section 6.3.2). 

However, better results can be obtained by taking advantage of the particular 
features of PCA. For affine equivariant estimation, the “natural” metric is that given 
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by squared Mahalanobis distances d; = (x; — ji)’ E-lx, — ji), which depend on 
the data through 71 while for PCA we have a fixed metric given by Euclidean 
distances. This implies that the concept of outliers changes. In the first case, an 
outlier that should be downweighted is a point with a large squared Mahalanobis 
distance to the center of the data, while in the second it is a point with a large 
Euclidian distance to the hyperplane (“large” as compared to the majority of points). 
For instance, consider two independent variables with zero means and standard 
deviations 10 and 1, and g = 1. The first principal component corresponds to the 
first coordinate axis. Two data values, one at (100, 1) and one at (10, 10), have 
identical large squared Mahalanobis distances of 101, but their Euclidean distances 
to the first axis are | and 10 respectively. The second one would be harmful to 
the estimation of the principal components, but the first one is a “good” point for 
that purpose. 

Boente (1983, 1987) studied M-estimators for PCA. An alternative approach to 
robust PCA is to replace the variance in (6.81) by a robust scale. This approach was 
first proposed by Li and Chen (1985), who found serious computational problems. 
Croux and Ruiz-Gazen (1996) proposed an approximation based on a finite number 
of directions. Hubert et al. (2012) propose a method that combines projection pursuit 
ideas with robust scatter matrix estimation. 

The next sections describe two of our preferred approaches to robust PCA. One 
is based on robust fitting of a hyperplane H by minimization of a robust scale, while 
the other is a simple and fast “spherical” principal components method that works 
well for large datasets. 


6.11.1 Spherical principal components 


In this section we describe a simple but effective approach proposed by Locantore 
et al. (1999). Let x have an elliptical distribution (6.7), in which case if Var(x) exists 
it is a constant multiple of Z. Let y = (x — y)/||x — ||; that is, y is the normalization 
of x to the surface of the unit sphere centered at wz. Boente and Fraiman (1999) showed 
that the eigenvectors t,,..., t, (but not the eigenvalues!) of the covariance matrix of y 
(that is, its principal axes) coincide with those of &. They showed furthermore that if 
o(.) is any scatter statistic, then the values o(x’ t;)° are proportional to the eigenvalues 
of &. Proofs are given in Section 6.17.11. 

This result is the basis for a simple robust approach to PCA, called spherical 
principal components (SPC). Let fi be a robust multivariate location estimator, and 
compute 


L 


ie id if x, # fi 
0 . 


otherwise 


Let V be the sample covariance matrix of the y;s with corresponding eigenvectors b; 
(j =1,...,p). Now compute A, a 6(x'b,)’, where G is a robust scatter estimator (such 
as the MAD). Call Aes the sorted As, ne 2c e Aeess and b,, the corresponding 


ROBUST PRINCIPAL COMPONENTS 237 


eigenvectors. Then the first g principal directions are given by the bis, j = 1,...,4, 
and the respective “proportion of unexplained variance” is given by (6.82), where A, 
is replaced by a 

In order for the resulting robust PCA to be invariant under orthogonal transforma- 
tions of the data, it is not necessary that #7 be affine equivariant, but only orthogonal 
equivariant; that is, such that @(TX) = Tf(X) for all orthogonal T. The simplest 
choice for fi is the “space median”: 


n 


ji = arg min x, — pl. 
fl = arg mi Dls | 


Note that this is an M-estimator since it corresponds to (6.9) with p(t) = Vt and 
2x = I. Thus the estimator can be easily computed through the first equation in (6.58), 
with W, (4) = 1/ vi, and starting with the coordinate-wise medians. It follows from 
Section 6.17.4.1 that this estimator has BP = 0.5. 

This procedure is deterministic and very fast, and it can be computed with 
collinear data without any special adjustments. Despite its simplicity, simulations by 
Maronna (2005) show that this SPC method performs reasonably well. 


6.11.2 Robust PCA based on a robust scale 


The proposed approach is based on replacing the expectation in (6.83) by a 
robust M-scale (Maronna, 2005). For given g-dimensional hyperplane H, call 
6(H) an M-scale estimator of the dy; in (6.87); that is, 6(H) satisfies the M-scale 


equation 
- IIx; — Xp ll? 
*Yo( Jaa. (6.88) 
ar o(H) 


Then we search for H having the form of the right-hand-side of (6.84), such that o(H ) 
is minimum. For a given H let 


7 1 n i n 2 a 
M= n W;X;, V= W(X; a ee - ) ) (6.89) 
Hi W; j=) 2 


i=1 


x, —&,, Il? 
w=w(! i Xai ) (6.90) 
oO 


where W = ’. It can be shown by differentiating (6.88) with respect to w and b, that 


with 


the optimal A has the form (6.84), where w = jiandbj,..., b, are the eigenvectors of 
Vv corresponding to its g largest eigenvalues. In other words, the hyperplane is defined 
by a weighted mean of the data and the principal directions of a weighted covariance 
matrix, where points distant from H receive small weights. 
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This result suggests an iterative procedure, in the spirit of the iterative reweight- 
ing approach of Sections 6.8.1 and 6.8.2. Starting with some initial Ah, compute 
the weights with (6.90), then compute fi and V with (6.89) and the corresponding 
principal components, which yield a new H. It follows from the results by Boente 
(1983) that if W is nondecreasing, then o decreases at each step of this procedure 
and the method converges to a local minimum. 

Since this estimator minimizes an M-scale, it will henceforth be denoted by M-S 
for brevity. 

There remains the problem of starting values. Simulations in Maronna (2005) 
suggest that the SPCs described in Section 6.11.1 are a fast and reliable starting point, 
and for this reason this will be our chosen initial procedure. 

A similar estimator can be based on an L-scale. For h < n, compute the L-scale 
estimator 


h 


i=1 
where the d,s are the ordered values of ||x,;—¥, ||’. Then the hyperplane 
H minimizing G(H) corresponds to the principal components of (6.89) with 
w; = (dy , < dy fi hy)" This amounts to “ trimmed” principal components, in which 
the x;S with the h smallest values of dg; are trimmed. The analogous iterative 
pee uie converges to a local minimum. The results obtained in the simulations 


in Maronna (2005) for the L-scale are not as good as those corresponding to the 
M-scale. 


Example 6.4 The following dataset from Hettich and Bay (1999) corresponds to 
a study in automatic vehicle recognition (Siebert, 1987). Each of the 218 rows cor- 
responds to a view of a bus silhouette, and contains 18 attributes of the image. The 
SDs are in general much larger than the respective MADNs. The latter vary between 
0 (for variable 9) to 34. Hence it was decided to exclude variable 9 and divide the 
remaining variables by their MADNs. The tables and figures for this example are 
obtained with script bus.R. 


Table 6.5 shows the proportions of unexplained variability (6.82) as a function of the 
number q of components, for the classical PCA and for M-S. 

It would seem that since the classical method has smaller unexplained vari- 
ability than the robust method, classical PCAs give a better representation. 
However, this is not the case. Table 6.6 gives the quantiles of the distances 
dy; in (6.87) for qg=3, and Figure 6.10 compares the logs of the respec- 
tive ordered values (the log scale was used because of the extremely large 
outliers). 

It is seen in the figure that the hyperplane from the robust fit has in general smaller 
distances to the data points, except for some clear outliers. On the other hand, in 
Table 6.5 the classical estimator seems to perform better than the robust one. The 
reason is that the two estimators use different measures of variability. The classical 
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Table 6.5 Bus data: proportion of unexplained variability 
for g components 
qd Classical M-S 
1 0.188 0.451 
2 0.083 0.182 
3 0.044 0.114 
4 0.026 0.081 
| 0.018 0.054 
6 0.012 0.039 
Table 6.6 Bus data: quantiles of distances to hyperplane 
0.1 02 03 O04 O05 O68 O7 O8 0.9 Max 
Classical 1.86 2.28 2.78 3.23 3.74 437 545 647 8.17 23 
Robust 1.20 1.59 1.90 2.17 259 2.92 3.47 433 5.48 1055 
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Figure 6.10 Distances to the hyperplane from classic and robust estimators, in 
log-scale. The straight line is the identity 


procedure uses variances, which are influenced by the outliers, and so large outliers 
in the direction of the first principal axes will inflate the corresponding variances and 
hence increase their proportion of explained variability. On the other hand the robust 
estimator uses a robust measure, which is free of this drawback, and gives a more 


accurate measure of the unexplained variability for the bulk of the data. 
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6.12 Estimation of multivariate scatter and location 
with missing data 


A frequent problem is that some of the cells of the n x p data matrix X = [x,)] are 
missing. The robust estimation of the scatter matrix and multivariate location in this 
case is a challenging problem. In the case of multivariate normal data, the maximum 
likelihood estimator can be computed by means of the expectation-maximization 
(EM) algorithm; see Dempster ef al. (1977) and Little and Rubin (2002). However, 
as generally happens with the maximum likelihood estimators for normal data, these 
estimators are not robust. 

Throughout this section it is assumed that the missing cells are completely at ran- 
dom; that is, the np events {xj is missing} are independent. It is desirable to have 
estimators that are both consistent for normal data and highly robust. Robust estima- 
tors for this case have been proposed by Little and Smith (1987), Little (1988) and 
Frahm and Jaekel (2010). However, all of these have low BPs. The one proposed by 
Cheng and Victoria-Feser (2002) is not consistent for normal data and is sensitive 
towards clustered outliers. 

Danilov et al. (2012) proposed generalized S (GS) estimators for estimating scat- 
ter and multivariate location. These estimators are defined by the minimization of 
a weighted M-scale of Mahalanobis distances. The Mahalanobis distances for each 
observation depend only on the non-missing components and are based on a marginal 
covariance matrix standardized so that its determinant is one. To compute these esti- 
mators, the authors proposed a weighted EM- type algorithm (Dempster et al., 1977) 
where the weights are based on a redescending score function. They prove that under 
general conditions these estimators are shape-consistent for elliptical data and shape- 
and size- consistent for normal data. We now give a more detailed description of this 
approach. 


6.12.1 Notation 


Let x; = (jy, sy Hip) 1<i<n, be p-dimensional i.i.d. random vectors. Let 
u; = (Uj, cust ad <i<n, be independent p-dimensional vectors of zeros and 
ones, where uy = 1 if Xj is observed and uy = 0 if it is missing. We also assume 
that u; and x; are independent (which corresponds to the “missing at random” 
assumption). 

Given x =(xj,...,x,)/ and a vector of zeros and ones u = (u),..., U,)’, let x be 
the vector whose components are all the x;s such that u; = 1, and let 


Pp 
p(u) = > Uj. (6.91) 


j=l 


In other words, x™ is a vector of dimension p(u) formed with the available entries of 
x. We assume the following identifiability condition: given | <j < k < p, there exists 
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at least one u;, 1 <i <n, with uj, = uy = 1. Let Vj, = {u : (u,...,u,)', u; € {0, 1}}. 
Then, given a p X p positive definite matrix 2 and u€ U,, we denote by x) the 
submatrix of & formed with the rows and columns corresponding to u,; = 1. Finally, 
we set X*®) = YO /|X|!/PH), so that [Z*™| = 1. 

Given a data point (x, u), a center m€ R” and a p X:p positive definite scatter 
matrix &, the partial square Mahalanobis distance is given by 


d(x, u,m, Z) = (x—m™)/(L)7! x — mm), (6.92) 


6.12.2 GS estimators for missing data 


It is easy to show that the Davies S-estimators for complete data introduced 
in Section 6.4.2 can also be defined as follows. Suppose that n > 2p and let 
p:R,— R, bea p-function. Given m € R? and a p X p positive definite matrix XZ, 
let S,,(m, %) be the solution in s to the equation 


1 - d(x;, m, x) 
- ——$—_}' = 05, 
n 2 i ( GS ) 


i=] p 


2 
E (o (Her )) = 0.5, (6.93) 
Cy 


where X has an elliptical density fy centered at O and scatter matrix equal to the 
identity. Usually fp is chosen such that fo(||x | |) is the standard multivariate normal 
density. Then the S-estimator (m,,, Z,,) is given by 


where c, is such that 


n? 


(m,,,&,,) =arg min S,(m,%), 
m,| Z|=1 
$= 5, Gm, E,), (6.94) 
E=3,5 


We now generalize the definition of S-estimators for the case of incomplete data. Let 
Q,, be a p Xp positive definite initial estimator for Xj. Given m€ R? and a px p 
positive definite matrix &, let S*(m, Z) be defined by 


n d(x\”, m4), r*(u)) 7 1 n 
bz pcu;)P Ae 51 tip Ca ay PD Cp(u,)> (6.95) 

(u;) /p(u;) 2, = 

|@; Co(uj) a 


where Cy is defined in (6.93) and p(u) in (6.91). First, location and scatter shape 
component estimators (m,,, Z,,) are defined by 


(m 


n? 


Y,,) = arg min S*(m, 2). (6.96) 
m= 
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Note that S* (m, ‘Z) = S* (m, Z) for all ¢ > 0, and therefore ¥,, only estimates the 
shape component of Xp. The GS-estimator of scatter for Xp (shape and size) is 


> (6.97) 


with S,, defined by 


acen.atse)) 
$= 5 Dey: (6.98) 


n 
Di craye 
i=l 


Cp(u,)Sn 

The estimating equations and the iterative algorithm for computing the GS esti- 
mator are given in Sections 6.17.12 and 6.17.12.1 respectively. 

The GS-estimator is available in the package GSE for R. This package includes 
the function EM that computes the EM-algorithm for Gaussian data, and the functions 
GSE and GRE that compute the GS estimators with Tukey bisquare and Rocke-type 
loss functions respectively (see Leung et al., 2017). 


Example 6.5 We consider again the wine dataset described in Example 6.2. We 
eliminate cells at random from this dataset with probability 0.2. The total number of 
eliminated cells is 125, which represents approximately 17% of the total. The tables 
and figures for this example are obtained with script winel.R. 


In Figure 6.11 we compare the adjusted squared Mahalanobis distances obtained 
with the EM estimator for Gaussian data and the GS-estimator based on a bisquare 
p-function. Since there are missing data, the squared Mahalanobis distances under 
normality follow y? distributions with different degrees of freedom. Therefore, in 
order to make fair comparisons, they are adjusted as follows. Suppose that d is the 
partial squared Mahalanobis distance from a row where only q variables out of p are 
observed; that is, p — g are missing. Then the adjusted squared Mahalanobis distance 
(referred as the “adjusted distance”) is defined by d* = y7~'(y7(d)), where 7; is 
the v7 distribution function with p degrees of freedom. For each estimator we show 
the plot of the adjusted distances against the observation index and a QQ-plot of the 
adjusted distances. We observe that while the Gaussian procedure does not detect any 
outlier, the robust procedure detects seven outliers. An observation is considered as 
an outlier if its adjusted distance is larger than y7~!(0.999). 


6.13 Robust estimators under the cellwise 
contamination model 


The model for contamination considered up to now consists of replacing a pro- 
portion of rows of the data matrix X with outliers (“casewise contamination’). 


CELLWISE CONTAMINATION MODEL 
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Figure 6.11 Wine example: comparison of adjusted squared Mahalanobis distances 
of the Gaussian EM-estimator (above) and the GS-estimator (below): In the left-hand 
column the adjusted distances are plotted against the index number and in the 
right-hand column their QQ plots are shown 


Algallaf et al. (2009) proposed a different model of outlier contamination. Instead 
of contaminating a proportion of rows (cases), each cell of X is contaminated 
independently with a given probability ¢. This is called cellwise (or independent) 
contamination. Then, even if € is small, as p increases, the probability that at 
least one cell of a row is an outlier tends to 1. In fact, this probability is equal to 
n(ée., p) = 1-1. —€)?. Therefore, since the affine equivariant estimators already 
presented have BP of at most 0.5, they are not robust against this type of contam- 
ination. An upper bound e* for the cellwise contamination BP can be obtained by 
solving (€*, p) = 0.5. The values of e* are shown in Table 6.7. 


Table 6.7 Minimal fraction of cellwise contamination that 
causes breakdown of an affine equivariant estimator 


Dimension 
=) 10 


1 2 3 4 15 20 100 


€* 0.50 0.29 0.21 0.16 0.13 0.07 0.05 0.03 0.01 


244 MULTIVARIATE ANALYSIS 


To deal with this type of outlier, Algallaf et al. (2009) replace the original 
sample with pseudo observations obtained as follows. Let ii; and Gj, 1<j<p be 
robust estimators of location and scale of the jth variable. Then the matrix of pseudo 
observations X* = (x;,) is defined by 


ih 
eA ij di A 
nau ( rs ) +i 
Jj 


where y is a y—function, such as the Huber function y(x) = max(min(x, c), —c), 
where c is a conveniently chosen constant. They then estimate the scatter matrix 
and location vector using the sample covariance and sample mean of X*. These 
estimators exhibit good behavior under the cellwise contamination model, but they 
may be much affected by case outliers. 

Danilov (2010) proposed treating outlying cells as if they were missing observa- 
tions. They then applied a casewise robust procedure for missing data to the modified 
dataset. Farcomeni (2014) developed a related robust procedure for clustering, called 
snipping. When the number of clusters is taken as one, this procedure gives robust 
estimators of multivariate location and scatter. The idea is to treat a given fraction a 
of cell values as it they were missing. Once the anp “missing” observations are fixed, 
the location and covariance estimators are obtained by maximizing the likelihood 
under normality. Finally the anp “missing” cells are chosen in order to maximize the 
likelihood. Therefore, computing these estimators requires solving a combinatorial 
optimization problem of high computational complexity. For this reason, an exact 
solution is not feasible and therefore the authors give only approximate algorithms. 
Agostinelli et al. (2015) simulated these approximate snipping estimators and found 
that they work quite well for independent outlier contamination but they are not 
sufficiently robust under case contamination. 

Agostinelli et al. (2015) proposed a two-step procedure for the case of inde- 
pendently contaminated data. In the first step, the filter to detect univariate outliers 
introduced in Gervini and Yohai (2002) and described in Section 5.9.2 is applied to 
each column of X. In the second step, all detected outliers are considered as miss- 
ing and the GS-estimator for missing data described in Section 6.12.2 is applied to 
the resulting incomplete dataset. This procedure is called the two steps GS (TSGS) 
estimator. It has been proved that the TSGS-estimator is consistent for normal data. 

Leung et al. (2017) gave a new version of the TSGS-estimators that performs 
better for casewise outliers when p > 10 and for cellwise outliers when the data are 
highly correlated. This is achieved using a new filter to detect outliers in the first step. 
This new filter is a combination of a consistent bivariate filter proposed by Leung et al. 
(2017) and another filter called DDC proposed by Rousseeuw and van den Bossche 
(2016). It is available in the GSE package mentioned in Section 6.12.2 and can be 
based on the bisquare or Rocke p-functions. 

Cellwise contamination also affects the behavior of robust PCA. Candés (2011) 
proposes a method that is robust towards independent contamination and is very fast 
for high dimensions, but its BP for row-wise contamination is zero. Maronna and 
Yohai (2008) propose an approach that is robust for both types of contamination, but 
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Figure 6.12 Wine example: comparison of the squared Mahalanobis distances of 
the MM- (above) and the TSGS-estimators (below); left column: the distances plotted 
against index number; right, QQ plots 


not fast enough for large p. The problem of combining resistance to cellwise outliers, 
resistance to casewise outliers, and high computational speed remains a difficult one. 


Example 6.6 Jn this example we compare the results obtained with the 
MM-estimator (see Section 6.5) and TSGS-estimators for the wine dataset 
employed in Example 6.5. The tables and figures for this example are obtained with 
script winel.R. 


The results are shown in Figure 6.12. For each estimator, we plot the squared 
Mahalanobis distances against the observation index and the corresponding QQ-plot. 
We observe that while the MM-estimator detects seven outliers, the TSGS-estimator 
detects ten. The cutoff point for a squared Mahalanobis distance to indicate an 
outlier is again 5 (0.999), 


6.14 Regularized robust estimators of the inverse of 
the covariance matrix 


Many applications require estimating the precision matrix Z~!, where & is the covari- 
ance matrix. This occurs, for example, when computing Mahalanobis distances, for 
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discrimination procedures in graphical models (see, for example, Lauritzen, 1996) 
and in mean-variance portfolio optimization (Scherer, 2015). Suppose that the rows 
of the matrix X are i.i.d. observations of a multivariate normal distribution. In this 
case, the maximum likelihood estimator of the covariance matrix & is the sample 
covariance S. When p/n is close to one, S is nearly singular, and if p > n it becomes 
singular, and therefore in these situations it is not possible to obtain a stable estima- 
tor of Z~!. Several approaches have been proposed to overcome this problem: the 
quadratic approximation method for sparse inverse covariance learning (Hsieh et al., 
2011), the constrained L,-minimization for inverse matrix estimation (CLIME) (Cai 
et al., 2011) and the constrained maximum likelihood estimator (CMLE) (Banerjee 
et al., 2008; Yuan and Lin, 2007; Friedman et al., 2008). The CMLE of the precision 
matrix XZ! is defined as the matrix T = (7), given by 


PP 
T= arg, max. {tater — tr(ST)- 4} Ya \ : 


i=l j=l 


where J is a fixed given number, and A > 0 denotes that A is positive definite. 
Note that maximizing log(det(T)) — tr(ST) is equivalent to maximizing the normal 
likelihood, and this occurs when T = S~!. Therefore the CMLE estimator is, as in 
the case of the regression LASSO, a penalized maximum likelihood estimator. The 
penalization term A” | pan prevents T from having large cell values. The 
penalty parameter A can be chosen by maximizing the likelihood function, computed 
by means of cross-validation. An efficient algorithm to compute the CMLE, called 
graphical lasso (GLASSO), was proposed by Friedman et al. (2008). 

Ollerer and Croux (2015) and Tarr et al. (2015) proposed robustifying the 
GLASSO estimators by replacing S by robust estimators, thus obtaining a higher 
resistance towards contamination. 


6.15 Mixed linear models 


Mixed linear models (MLM) are linear models in which the expectations of the 
observations are linear functions of a parameter vector f, but unlike the ordinary 
linear models, the covariance matrix has a special structure that depends on further 
parameters. 

Consider, for example, an ANOVA model with one fixed and one random factor, 
in which the observations y;, verify 


yy =B+a,tuy,, lsisn, 1l<jsaq, 


ij? 
where the #;s are unknown parameters, the a;s are random variables and u,; is an 


error term. The standard assumptions are that all a;s and u,;s are independent, with 
a; ~ NO, t~) and Uy ~ N(0,o7). Then for 1 <i < nand 1 <j<q 


Ey; = Box Ey; = +0’, Ey, Yin = 7 for j #h. 
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Putting y; = (Vj, ---Vjg)’, it follows that 


2 
2 T 
s.~n, (B.0*(4+ Zv)), 


where V is the g X q matrix with all its elements equal to one. 

This situation is a particular case of a class of MLMs that will be dealt with in this 
section. The data are (X;, y;), i= 1,...,, where the g X p matrices X; can be fixed or 
random, and y; € R? are random vectors. Call and & the mean vector and covariance 
matrix of y; in the case X;, is fixed; if X; is random, mw and & are the conditional means 
and covariances given X. It is assumed that w and & depend on unknown parameters 
in the following way: 


J 
W(X, B)=XB, Xn) =n (%: +) i) (6.99) 
j=l 


where B € R?’, y € R! and 7 are unknown parameters, with y; > 0 and y > 0; and v; 
(j = 0,..., J) are known positive-definite g x q matrices. 

It is seen that the ANOVA model described above is an instance of (6.99) with 
X; =1,,4 = 07, Vp =I, V, = V, J = land y, = t*/o”. 

This class of models includes, among others, ANOVA models with repeated mea- 
surements, models with random nested design and models for longitudinal data. It 
contains models of the form 


eh 
y,=X,p+ )\ Zaj;t+e, 1<i<n, (6.100) 
j=l 


where the Xjs are as above, Z;, 1 <j < J, are p x q; known design matrices for the 
random effects, a,; are independent q;-dimensional random vectors with distribu- 
tion N o71,.), and e; (1 <i <n) are p-dimensional error vectors with distribution 

N,(0, oI pe We are then in the framework (6.99) with yg = o.y= (Y,,---57)) with 
= 02/0? > Oand V; = ZL,, l<j<J. 

An instance of (6. 100) i i: ‘longitudinal analysis models, in which, for a random 
. of n individuals, one response and q covariates are measured at fixed times 
t,..-,t,. Let y,; be the response for individual i at time ¢,, and call x; the value of 
covariate j at time ¢, for the individual i. It is assumed that 


Dp 
yy = > Pi Xijx +a; + Uy 
k=1 


where B = (f),.. By Y is an unknown vector, a,, the effect of individual i, is a N(O, tT’) 
random variable, and the error “i is N(O, 07). It is assumed that all the wu 8 and a;s 


are independent. Let y; = (ij; -... Vig)’, X; the p X g matrix whose (, k) element is xj, 
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U; = (Uj, «5 U; me and z the g -dimensional vector with all elements equal to one. Then 
we can write 


y; =x,B + 2a; +u,1<i<n, 


which has the form (6.100). 


6.15.1 Robust estimation for MLM 


Consider a sample (X,, y;), ...(X,,, y,,) of the MLM. If the X; are fixed, it is assumed 
that the distribution of y; belongs to an elliptical family (6.7) with location vector yu 
and scatter matrix & given by (6.99). If the X; are random, the same applies to the 
conditional distribution of y; given X;. 

Most commonly, 6, y and 4 are estimated by maximum likelihood (ML) or 
restricted maximum likelihood (REML), assuming that the elliptical family is the 
multivariate normal; see for example (Searle et al., 1992). However, as can be 
expected, these estimators are not robust. Several robust estimators for MLM with 
different degrees of generality have been proposed (Fellner, 1986; Richardson and 
Welsh, 1995; Stahel and Welsh, 1997). Copt and Victoria-Feser (2006) proposed 
an S-estimator similar to the one described in Section 6.4.2 that can be applied 
to the model (6.99). The S-estimators are extended to a more general class of 
models by Chervoneva and Vishnyakov (2011). While all these proposals are robust 
for casewise contamination, they cannot cope with y,s affected by independent 
contamination (Section 6.13) when gq is large. Koller (2013) proposed an estimator, 
denoted hereafter SMDM, applicable to a very general class of MLMs obtained by 
bounding the terms of the maximum likelihood equations of Henderson et al. (1959). 

Agostinelli and Yohai (2016) introduced a new class of robust estimators called 
composite t -estimators for MLM, which are robust under both casewise and cellwise 
(or independent) contamination. These estimators are defined similarly to the com- 
posite likelihood estimators proposed by Lindsay (1988). For p-dimensional obser- 
vations y,, the composite likelihood estimators are based on the likelihood of all 
subvectors of dimension p* for some p* < p. The composite t-estimators are defined 
by minimizing a t-scale (see Section 6.4.5) of the Mahalanobis distances of the 
two-dimensional subvectors of the y,s. Their asymptotic breakdown point is 0.5 for 
the casewise contamination and is 0.29 for independent contamination. S-estimators 
and composite t-estimators will be described below. 


6.15.2 Breakdown point of MLM estimators 


As in Section 6.13, we can consider two types of contaminations for MLM: casewise 
and cellwise. Let F be the common distribution of (X;, y;), | <i <n, Xj the jth row 
of X; and y,; the jth component of y;. A casewise contamination of size € occurs when 
(X;, y;) is replaced by 


(X7,y;) = 6,(X;, y;) + (1 — 6;)(Z;, w,), 
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where the 6; are independent Bernoulli random variables with success probability 
(1 — €), and (Z;, w;), | < i <n, are 1.i.d. with an arbitrary joint distribution G. More- 
over, it is required that the three sets of variables, (X;, y;)) <j<y, (Zj, Wi);<j<n and 
(5;)1<i<n> are independent. We denote by F;"** the distribution of (X;, y;). Note that 
Fes°(X, y) = (1 — €)F(X, y) + €G(X, y) is determined by G. 
Instead, we have cellwise contamination of size € when (x;;, y,;) is replaced by 
(K},5) = 54 Xjoyy) + (1-5), w,), Sisal sisa, 


where the 6, are independent Bernoulli random variables with success probabil- 
ity (l-—e), and (Zjj, Wi;), 1<i<n,1<j<q, are independent with an arbitrary 
joint distribution G;. Once again, the three sets of variables, (Xj, Yjj)1 <icni<j<q 
(ij Wii <icni<jcq ANd (6;) 1 <icn1cjcqe ate Tequired to be independent. Let Fell be 
the distribution of (X;, y*), where the jth row of X* is xi and the jth coordinate of y* 
is Vie Clearly F°*"' depends on G;, wig 

Let 6 = ny, and consider an estimator (B. 6) of (6,0) with asymptotic value 
(Boos 6..)- The case contamination asymptotic breakdown point of B is defined by 


Hclhad = inf {« > sup 6.72 ll = co} : (6.101) 
G 


Two asymptotic casewise breakdown points of 6. can be defined: one to infinity, 
denoted by €7%,,.(0,,), and the other to 0, denoted by e=,,.(0,,). They are given by 


case 


€ vase (Oo) = inf {« : sup ||6..(F™) |] = co} (6.102) 
G 


Ex yee(Oqq) = inf {e : inf 8,.(FE"*)Il = o} (6.103) 


Three similar asymptotic breakdown points can be similarly defined for cellwise 
contamination 


€ cei Bao) = int {: = sup IBFe il = »} (6.104) 
le 

Eu Oco) = inf {: + sup [0,0(F Il = =} (6.105) 
Li3s*+5) q 


= ~~ . < . ~ nT = 
E (900) = inf {« : ge |0,,Fo" || = o} : (6.106) 
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6.15.3 S-estimators for MLMs 


The class of S-estimators proposed by Copt and Victoria-Feser (2006) for the model 
(6.99) are defined as follows. Let S be an M-scale of the form (6.29) with p -function 
p, and call sp the solution of 


Ep (2) = 6, (6.107) 


So 


where v ~ y;. Put 


mB, 1, Y) = UY; H(X, B), x(n, Y))s 


where d denotes the squared Mahalanobis distance (6.4) and w and & are given by 
(6.99). 
Then then the S-estimator for the MLM is defined by 


(6,9, 7) = arg min det E(n,7) 


subject to 
Sim, (B, n, Y), fag m,,(B, q, Y)) = So: 


Put Zn,y) 
2) = 


= iis aie ony (6.108) 


which depends only on y. It is easy to show that the S-estimators for a MLM can be 
also defined by 


(B.7) = argmin S {dly,, aX ). Z*(y)), (Yq, HX B), =*(y))}, 
B= =S {aly WX B) (1,7), d(y,, HX. B),  X(1,y))}. 
0 


Copt and Victoria-Feser (2006) showed that the asymptotic breakdown point 
of the S-estimator for casewise contamination in the three cases defined in 
(6.101)-(6.103) are equal to min(6, | — 6) with 6 in (6.107). However, as in the 
case of scatter estimation (6.13), the three cellwise breakdown points defined in 
(6.104)—(6.106) tend to 0 when g > oo. 


6.15.4 Composite t-estimators 


Given a vector a =(a,,...,a,)', a pXp matrix A and a couple (j,/) of indices 
(1 <j<l<p), puta! = (4;, a,)' and denote by Aj, the submatrix 


MIXED LINEAR MODELS 251 


In a similar way, given a p x k matrix X we denote by X” the matrix of dimension 
2 x k built by using rows j and / of X. 
Pairwise squared Mahalanobis distances are defined by 


m(B, 7) = Ay, WB), By), 


with &* defined as in (6.108). 
Let p, and p, be two bounded and continuously differentiable p-functions such 
that 2p,(u) — p,(u)u = 0. Call s;(B, y) the M-scale defined as solution of 


n nv'(B,Y) 

1 J z 

2 > ae 6.109 
noel 5(B,Y) 


and rt; “KB, y) the r-scale (see Section 6.4.5) given by 


nl! 
= nv(B,Y) 
ar ees 
7 (B. Y) = 5,(B.¥)= Ne PCR OM N (6.110) 
Put 
p-|1 p 
LB) => YB. (6.111) 
j=l I=j+1 


Then the composite t-estimators of 6 and y are defined by 
(B, 7) = arg min L(B.7), (6.112) 


and the estimator 7 of 7 by solving 


n p-l p mB 
ee y(t GB.) )-s (6.113) 


i=1 j=l [=j+1 
where Sq is defined by 


Ep (2) = 6, (6.114) 
So 

where v ~ y>. As a particular case, when p, = p, we have the class of composite 

S-estimators. Agostinelli and Yohai (2016) considered p;, k = 1, 2 in the SHR family, 

as defined in Section 6.5. These are of the form p;(d) = psyp(d/c,), where psyp is 

defined in (6.48) and c, are given constants. They found that the choice c, = | and 

Cy = 1.64 yields a good trade-off between robustness and efficiency. 

It can be shown that the three asymptotic casewise breakdown points defined 
in (6.101)-(6.103) of the composite t-estimator are equal to min(6, | — 6). More- 
over, the three cellwise breakdown points defined in (6.104)-(6.106) are given as 
the solution to 1 — (1 — e)? = min(6, | — 5). If 6 = 0.5, the three cellwise breakdown 
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points are equal to 0.29. Looking at Table 6.7, we observe that this value is the same 
as the maximum asymptotic breakdown point of equivariant scatter estimators for 
two-dimensional data. This result can be explained by the fact that the composite 
t-estimator is based on subvectors of dimension two. These results hold for any value 
of g. Agostinelli and Yohai (2016) also showed that under very general conditions the 
composite t-estimators are consistent and asymptotically normal. 

Computational aspects and algorithms for the composite t-estimator are 
discussed in the Supplementary Material of Agostinelli and Yohai (2016). These 
algorithms are implemented in the R package robust varComp, which is available 
in the Comprehensive R Archive Network. 

A simulation study in Agostinelli and Yohai (2016) confirms that composite 
t-estimators exhibit good behavior under both types of outlier contamination, while 
Copt and Victoria-Feser’s S-estimators as well as Koller’s SMDM estimator can 
only cope with casewise outliers. 


Example 6.7 We study the behavior of the different estimators when applied to a 
real dataset collected by researchers at the University of Michigan (Anderson D.K. 
etal., 2009). This is a longitudinal study of 214 children with neural development dis- 
orders. Outliers are present in the couples rather than in the units, and the composite 
t-estimator yields quite different results to the maximum likelihood and S-estimators. 
The tables and figures for this example are obtained with script autism.R. 


The children in the study were divided into three diagnostic groups (d) at the age of 
two: autism, pervasive developmental disorder and nonspectrum children. An index 
measuring the degree of socialization was obtained after collecting information at 
the ages (a) of 2, 3, 5, 9, and 13 years. Since not all children were measured at all 
ages, 41 children for whom complete data were available were selected. We analyze 
this data using a regression model with random coefficients where the socialization 
index y is explained by the age a, its square, the factor variable d and the interactions 
between a and d, plus an intercept. Let a = (a), a, a3, dy, das) = (2,3,5,9, 13), 
dpi k = 1,2,3, 1 <i < 41 the indicator of diagnostic j for child i at age 2, and call 
yj the socialization index of child i at age a; for 1 <i < 41 and 1 <j < 6. Then it is 
assumed that 


2 
yy = 5 + B24; + bi3a: + Bad; + Bsd(2; 


2 2 
+ Boa; x day; + Bia; x day; + Bga; x day + Boa; x di + Eij> 
for 1 <i< 41,1 <j <5, where b,,,b;.,b;3 are i.i.d. random coefficients with mean 
(PB), Bo, Bz) and covariance matrix 


012 
Ogq2 


O12 Gqq2 Fg2Q2 
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Here f4,..., Bo are fixed coefficients and the €;; are i.i.d. random errors, independent 
of the random coefficients, with zero mean and variance o,,. Then the model may be 
rewritten in terms of (6.99) with p =5,n=41,J =Sandk=9,y; = (y1,.-. Vis)’, 


X; = (€, a, €, di, dye, diy, din, Aq), diay ;€), 


where e is a vector of dimension 5 with all the elements equal to one and 
c= (oF seal The variance and covariance structure X(y,y) = nU + pa 7) 
has the following components: 


/ 
V, =ee’, V, = aa’, V3 = cc’, V, = ea’ +ae,V;=ec’+ce’, V,=ac' +ca’, 


N= Oc¢,Y| = O11 /Oces = ial Cees = Tiel Os 
%4 = O1g/ Oe» >= O1¢2/Oces 6= Cae Fie: 


The parameters were estimated using the following methods: 


e restricted maximum likelihood (ML); 

e composite T; 

e the Copt and Victoria-Feser (2006) S-estimator (CVFS), using a Rocke p-function 
with asymptotic rejection point a = 0.1; 

e the SMDM estimator defined in Koller (2013); 

e composite t-estimator with c, = | and c, = 1.64. 


Table 6.8 shows the estimators and standard deviations for the fixed effects param- 
eters using different methods, while Table 6.9 reports the estimators of the random 
effect terms. ML, S and SMDM behave similarly but differently from the composite t 
method. The main differences are in the estimation of the random effects terms, both 
in size (error variance component) and shape (correlation components). Composite 
T assigns a large part of the total variance to the random components while the other 
methods assign it to the error term. We observe that variances estimated by compos- 
ite t are in general larger than those estimated by the other methods. On the other 


Table 6.8 Autism dataset. Estimators of the fixed effects parameters. The p-values 
are reported between brackets 


2 : : 2 2 
Method Int. a a Si) 5(2) aXSq) AXS8Q) XS a” X 82 


Max. lik. 12.847 6.851 —0.062 -5.245 -2.154 -6.345 -4.512 0.133 0.236 
[0.000] [0.000] [0.579] [0.041] [0.325] [0.000] [0.000] [0.446] [0.121] 

Composite tr 12.143 6.308 —0.089 -5.214 -4.209 -5.361 —3.852 0.082 0.061 
[0.000] [0.000] [0.329] [0.000] [0.012] [0.000] [0.001] [0.578] [0.677] 


S Rocke 10.934 7.162 —0.107 -—4.457 -0.108 -5.769 -4.995 0.094 0.419 
[0.000] [0.001] [0.666] [0.049] [0.957] [0.002] [0.000] [0.688] [0.011] 
SMDM 12.346 6.020 0.001 —5.192 -—2.173 -5.190 -3.870 0.046 0.151 


[0.000] [0.000] [0.992] [0.010] [0.213] [0.000] [0.000] [0.781] — [0.300] 
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Table 6.9 Autism dataset. Estimates of the random effects parameters 


Method Oi o, O22 o; 


aa a O12 Gad o 


a 


Maximum likelihood 2.643 2.328 0.102 0.775 0.429 —0.038 51.360 


Compositer 9.362 9.670 0.052 -—4.019 -—0.002 —0.327 5.164 
S Rocke 9.467 3.373 0.222 2.170 1.062 —0.349 22.209 


SMDM 5.745 0.092 0.115 0.727 0.813 0.103 25.385 


side, the estimators for the error variance obtained with the composite t-estimator 
are smaller than those obtained with the other methods. As a consequence, the infer- 
ence based on the composite t-estimator concludes that the regression coefficients 
corresponding to the diagnostic variables d(,) and dy) are significant, while for the 
other estimators they are not significant. 

To better understand the reasons for these discrepancies, we investigate cell, cou- 
ple and row outliers. For a given dimension | < g < p, we define as g-dimensional 
outliers those g-dimensional observations such that the corresponding squared 
Mahalanobis distance is greater than a the a-quantile of ie for a given a. In 
particular we call cell, couple and row outliers respectively the 1-dimensional, 
2-dimensional and p-dimensional outliers. The composite t procedure identifies 33 
couple outliers out of 410 couples (8%) at a = 0.999. The number of rows with at 
least one couple of outliers is 12 out of 41; that is, 28% of the rows are contam- 
inated, and this is a too high percentage of contamination for the S and SMDM 
procedures. 


6.16 *Other estimators of location and scatter 


6.16.1 Projection estimators 


Note that, if z is the sample covariance matrix of x, and G(.) denotes the sample SD, 
then 
G(a’x) =a'Za V aeR’. (6.115) 


It would be desirable to have a robust & fulfilling (6.115) when @ is a robust 
scatter like the MAD. It can be shown that the SD is the only scatter measure 
satisfying (6.115), and hence this goal is unattainable. To overcome this difficulty, 
scatter P-estimators (Maronna et al., 1992) were proposed as “best” approximations 
to (6.115), analogous to the approach in Section 5.11.2). Specifically, a scatter 
P-estimator is a matrix & that satisfies 

( G(a’x)? ) 
log ( 

a’Za 


sup = min. (6.116) 


aZ0 
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A similar idea for location was proposed by Tyler (1994). If f#(.) denotes the 
sample mean, then the sample mean of x may be characterized as a vector v satisfying 


f(a'(x—v))=0 V aeR?’. (6.117) 


Now let ff be a robust univariate location statistic. It would be desirable to find v 
satisfying (6.117); this unfeasible goal is avoided by defining a location P-estimator 
as a vector v such that ey 

US so 2) en (6.118) 

llal|=1 o(a’x) 

where 6 is a robust scatter (the condition ||a|] = 1 is equivalent to a¥# 0). Note that 

v is the point minimizing the outlyingness measure (6.49). The estimator with j# and 

6 equal to the median and MAD respectively, is called the MP-estimator by Adrover 
and Yohai (2002). 

It is easy to verify that both location and scatter P-estimators are equivariant. It 
can be shown that their maximum asymptotic biases for the normal model do not 
depend on p. Adrover and Yohai (2002) computed the maximum biases of several 
estimators and concluded that MP has the smallest. Unfortunately, MP is not fast 
enough to be useful as a starting estimator. 


6.16.2 Constrained M-estimators 


Kent and Tyler (1996) define robust efficient estimators, called constrained 
M-estimators (CM estimators for short), as in Section 5.11.3: 


ne fix 1 
(2) - d)+=In|z| }, 
(H, &) wenn { 1D i) 5 n| i} 
with the constraint 


1 n 
= d) <e, 
n i<e 


where d; = (x; — jt)’ E-lx, — ff), pis a bounded p-function and © ranges over the set 
of symmetric positive-definite p x p matrices. 
They show the FBP for data in general position to be 


* 


et = ~min({nel, [n(1 —e) — p]), 


and hence the bound (6.27) is attained when [ne] = (n — p)/2. 
These estimators satisfy M-estimating equations (6.11)-(6.12). By a suitable 
choice of p, they can be tuned to attain a desired efficiency. 
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6.16.3 Multivariate depth 


Another approach for location is based on extending the notion of order statistics to 
multivariate data, and then defining mw as a “multivariate median” or, more generally, 
a multivariate L-estimator. Among the large amount of literature on the subject, we 
note the work of Tukey (1975b), Liu (1990), Zuo and Serfling (2000), and Bai and 
He (1999). The maximum BP of this type of estimator is 1/3, which is much lower 
than the maximum BP for equivariant estimators given by (6.27); see Donoho and 
Gasko (1992) and Chen and Tyler (2002). 


6.17 Appendix: proofs and complements 


6.17.1 Why affine equivariance? 


Let x have an elliptical density f(x, uw.) of the form (6.7). Here w and & are the 
distribution parameters. Then if A is nonsingular, the usual formula for the density 
of transformed variables yields that y = Ax + b has density 


f(A7(y—b), 4,2) = f(y, Au + b, AZA’), (6.119) 


and hence the location and scatter parameters of y are Au + b and AXA’ respectively. 

Denote by (fi(X), E(x) the values of the estimators corresponding to a sample 
X = {x,,...,X,,}. Then it is desirable that the estimators (fi(Y), Ly )) corresponding 
to Y = {y,,..,y, }, with y; = Ax; + b, transform in the same manner as the parameters 
do in (6.119); that is: 


RY) = AW(X) +b, EY) = AEWA’, (6.120) 


which corresponds to (6.3). 

Affine equivariance is natural in those situations where it is desirable that the 
result remains essentially unchanged under any nonsingular linear transformation, 
such as linear discriminant analysis, canonical correlations and factor analysis. This 
does not happen in PCA, since it is based on a fixed metric that is invariant only under 
orthogonal transformations. 


6.17.2 Consistency of equivariant estimators 


We shall show that affine equivariant estimators are consistent for elliptical distribu- 
tions, in the sense that if x ~ f(x, w,Z) and fi,, and Z,,, then 


fico =H, EZ, = cE, (6.121) 


where c is a constant. 
Denote again for simplicity (f,,(x), Z,,(x)) as the asymptotic values of the 
estimators corresponding to the distribution of x. Note that the asymptotic values 
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share the affine equivariance of the estimators; that is (6.120) holds also for fi and p3 
replaced by fi,, and =. 

We first prove (6.121) for the case u = 0, & =I. Then the distribution is spherical, 
and so D(Tx) = D(x) for any orthogonal matrix T. In particular, for T = — I we have 


Hoo (TX) = foo (—X) = — foo (X) = flog (X), 
which implies fi,, = 0. At the same time we have 
E.(Tx) = TET’ = £,.%) (6.122) 


for all orthogonal T. Write a = UAU’, where U is orthogonal and 
A = diag(A),.., 4,). Putting T = U~! in (6.122) yields A = Z,,(x), so that B,,(x) is 
diagonal. Now let T be the transformation that interchanges the first two coordinate 
axes. Then TET = E..(x) implies that A, = A, and the same procedure shows 
that A; =... = A,. Thus E.) is diagonal with all diagonal elements equal; that is, 
Eo(x) = cl. 

To complete the proof of (6.121), put y = ~+ Ax, where x is as before and 
AA’ = ¥, so that y has distribution (6.7). Then the equivariance implies that 


Roy) =h+AR,~O=H, E(y)=AL,,()A' = ck. 


The same approach can be used to show that the asymptotic covariance matrix 
of fi verifies (6.6), noting that if fi has asymptotic covariance matrix V, then Afi has 
asymptotic covariance matrix AVA’ (see Section 6.17.7 for details). 


6.17.3 The estimating equations of the MLE 


We shall prove (6.11)-(6.12). As a generic notation, if g(T) is a function of the p x q 
matrix T = [7], then 0g/0T will denote the p x q matrix with elements dg/ot,; a 
vector argument corresponds to g = 1. It is well known (see Seber, 1984) that 


o|Al 


—— =|A|A7, 6.123 
a = IAl (6.123) 
and the reader can easily verify that 

db’ Ab 1 db’ Ab ; 
—— =(A+A’)b and —— = bb’. 6.124 
ie ee aie 

Put V = =~!. Then (6.9) becomes 

ave;(p(d;)) — In|V| = min, (6.125) 


with d; = (x; — 4)! V(x; — p). It follows from (6.124) that 


O82 Ven [and 2 ee (6.126) 
va My BLA = OE OA) ; 
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Differentiating (6.125) yields 
2Vave;{ W(d;)(x; — #)} =0 and ave;{W(d,)(x,;— w)(x;- pw} v'=0, 
which are equivalent to (6.11)-(6.12). 


6.17.4 Asymptotic BP of monotone M-estimators 
6.17.4.1 Breakdown point of a location estimator with & known 


It will be shown that the BP of the location estimator given by (6.17) with & known 
is €* = 0.5. It may be supposed, without loss of generality, that & = I (Problem 6.4), 
so that d(x, wu, X) = || x— yp IF: Let u(d) = Vaw, (d). It is assumed that for all d 


u(d) < K = lim u(s) < o. (6.127) 


For a given € and a contaminating sequence G,,,, call yz, the solution of (6.17) corre- 
sponding to the mixture (1 — €)F + €G,,. Then the scalar product of (6.17) with y,, 


yields 


(x ~ iy) hy 
d-OE ke, — 
if I|x — Halll Hill 
x— / 
+ €Eg v(||x — Hyll?) ee Hn Hm) Hm _ (6.128) 
ao IX = MyM Hanll 
Assume that || /,,,|| — co. Then, since for each x we have 
x— 1 
ee ee a 
m—co moo ||x — Kn lll nll 


(6.128) yields 
0<(U-e&)K(-l1)+€K, 


which implies € > 0.5. 

The assumption (6.127) holds in particular if v is monotone. The case with v 
not monotone has the complications already described for univariate location in 
Section 3.2.3. 


6.17.4.2 Breakdown point of a scatter estimator 


To prove (6.25), we deal only with &, and hence assume yp is known and equal to 0. 
Thus &,, is defined by an equation of the form 


EW(x’Zz!x)xx/ = E,. (6.129) 


Let a = P(x = 0). It will first be shown that in order for (6.129) to have a solution, 


it is necessary that 
K(1-a) =p. (6.130) 
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Let A be any matrix such that £,, = AA’. Multiplying (6.129) by Aw! on the left and 
by A7!’ on the right yields 
EW(y'yyy’ = 1, (6.131) 


where y = A~!x. Taking the trace in (6.131) yields 


p = EW(llylllyll-iy #0) < P(y #0) sup (dW(d)) = K(1 - a), (6.132) 


which proves (6.130). 
Now let F attribute zero mass to the origin, and consider a proportion € of con- 
tamination with distribution G. Then (6.129) becomes 


(1 — e)EpW(x'Sz!x)xx’ + cEGW(x/Ez) xxx’ = S. (6.133) 
Assume € < €*. Take G concentrated at xq : 
(1 — Ep W(x’ Za x)xx’ + eW(x, Del xp)xox), = By. (6.134) 


Put x, = 0 first. Then the distribution (1 — €)F + €G attributes mass «€ to 0, and 
hence in order for a solution to exist, we must have (6.130); that is, K(1 — €) > p, and 
hence e* < 1—p/K. 

Let x, now be arbitrary and again let A be any matrix such that £ = AA’; then 


(1 - e)ErWy'y)yy’ + EWlypYo)¥0¥9 = 1 (6.135) 


where y = A~!x and y, = A7!Xp. Let a = yo/|lyo||. Then multiplying in (6.135) by 
a’ on the left and by a on the right yields 


(1 — e)EgWllyll7(y’a)” + €W(lyoll Dllyoll” = 1 > eWCllyollllyoll?. (6.136) 


Call A, and A, the smallest and largest eigenvalues of ae Let Xp now tend to infinity. 
Since € < e*, the eigenvalues of =. are bounded away from zero and infinity. Since 


2 F-1 
llyoll” = oP Xp 2 


Dp 


it follows that yg tends to infinity. Hence the right-hand side of (6.136) tends to eK, 
and this implies e < 1/K. 

Now let € > e*. Then either 4, > 0 or A; — oo. Call a; and a, the unit eigenvec- 
tors corresponding to A, and /,,. Multiplying (6.133) by ai on the left and by a, on 
the right yields 


(1 — ©) E, W(x’ S5!x)(x'a,)? + cE W(x’ Ee) x)(x'a,)? = Ay. 


Suppose that A, — oo. Divide the above expression by /,; recall that the first expec- 
tation is bounded and that 
/ 


Ay 7 


(x’a,)? Ss 
lL <x'3 Ix, 
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Then in the limit we have 
A x’a 2 
l= cBgW(x' Sg) < &€K, 
1 


and hence € > 1/K. 
On the other hand, taking the trace in (6.136) and proceeding as in the proof of 
(6.132) yields 
p= (1-®)EpWillylllyll? + €EgWdlylilyll? = — NE-WlyllIlyll?. 
Note that 
(x’a,)° 


lly? > ——. 


Ay 


Hence 4, > 0 implies lly||? — co and thus the right-hand side of the equation above 
tends to (1 — €)K, which implies € > 1 — p/K. 


6.17.5 The estimating equations for S-estimators 
We are going to prove (6.31)—(6.32). Put, for simplicity, 
V = Zand d; = d(x;, p, V) = (x; — Ww) V(x; - 


and call o(y, V) the solution of 


dj 
we, { o(2) } = 6. (6.137) 


Then (6.28) amounts to minimizing o(ju, V) with |V| = 1. Solving this problem by 
the method of Lagrange’s multipliers becomes 


(H, V, A) = o(u, V) + ACV] — 1) = min 
Differentiating g with respect to A, w and V, and recalling (6.123), we have 
|V| = 1 and . ‘ 
Oo Oo =1 
— =0, —+AV' = 0. 6.138 
Ou ov ( ) 
Differentiating (6.137) and recalling the first equations in (6.138) and in (6.126) 
yields 


d; od; do d 
ave; {w(2) (ot - 4%.) | = 2oave; {w(2) m-w} = 0, 
o Ou Ou o 


which implies (6.31). Proceeding similarly with respect to V we have 


Ww a og +dAV7! 
ave; = (oy av i 


= cave; {w (2) (x; — w(K) p'} +bV-! =0, 
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dj 
b= ave; {w(2) a\; 
oO 


and this implies (6.32) with c = —b/o. 


with 


6.17.6 Behavior of S-estimators for high p 


It will be shown that an S-estimator with continuous p becomes increasingly similar 
to the classical estimator when p — oo. For simplicity, this property is proved here 
only for normal data. However, it can be proved under more general conditions which 
include finite fourth moments. 

Because of the equivariance of S-estimators, we may assume that the true parame- 
ters are & = I and yw = 0. Then, since the estimator is consistent, its asymptotic values 
are fi, =0, S,, =I. 

For each p, let d”) = d(x, fi,,, 52) Then d”) = ||x||? ~ X and hence 


(p) (Pp) 
E (<=) =i, 8p () aS (6.139) 
D D P 


which implies that the distribution of d”/p is increasingly concentrated around 1, 
and d”)/p > 1 in probability when p > oo. 

Since p is continuous, there exists a > 0 such that p(a) = 0.5. Call o,, the scale 
corresponding to d”) 


d) 
0.5 = Ep : (6.140) 
o, 
We shall show that 
d”) 
8, (6.141) 
o 


Since d® |p, 1, we have for any € > 0 


7 oa 1 0.5 
(a) He lien 


and 
E ( am ) (a(1 — €)) < 0.5 
p\ ——— ] > pla -€ Roy 
p/(a(l — €)) 
Then (6.140) implies that for large enough p 


a cee eee 
a(1 +e) P~ ale) 
and hence lim,_,,,(ao,,/p) = 1, which implies 
qd) d®)/p 
=a 


oy ao,,/P 


— 
pa 
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as stated. This implies that for large n and p the weights of the observations W(d;/G) 


are = = 
d; d(x;, Ht, & A(X;, fgg Log 
iy (2) 7 w (So) . w( Bebate) aie 
o o oy 


that is, they are practically constant. Hence ji and E, which are weighted means 
and covariances, will be very similar to Ex and Var(x), and hence very efficient for 
normal data. 


6.17.7 Calculating the asymptotic covariance matrix 
of location M-estimators 


Recall that the covariance matrix of the classical location estimator is a constant mul- 
tiple of the covariance matrix of the observations, since Var(x) = n~! Var(x). We shall 
show a similar result for M-estimators for elliptically distributed data. Let the estima- 
tors be defined by (6.14)-(6.15). As explained at the end of Section 6.12.2, it can be 
shown that if x, has an elliptical distribution (6.7), the asymptotic covariance matrix 
of fi has the form (6.6): V = vd, where v is a constant that we shall now calculate. 

It can be shown that in the elliptical case the asymptotic distribution of fi is the 
same as if & were assumed known. In view of the equivariance of the estimator, we 
may consider only the case wy = 0, & = I. Then it follows from (6.121) that a. =cl, 
and taking the trace in (6.15) we have that c is the solution of (6.21). 

It will be shown that 


= 24 
v= oe 
where (writing z = ||x||?) 
ae Tears z 
a=Ew,(2) Zz b= 2EW; (2) 2+pEw, (2). (6.142) 
c cl ec c 
We may write (6.14) as 
>) Bx, w) = 0, 
i=] 


with ‘ 
x — 
woewen (HEHE) cp 
It follows from (3.49) that V = B-!AB’~!, where 
A = EW(x, 0)¥(x, 0)’, B = EW(x,, 0), 


where W is the derivative of W with respect to yw; that is, the matrix W with elements 


Yi, = OV; /OU. 


APPENDIX: PROOFS AND COMPLEMENTS 263 


2 2: 
A=EW, (er) xx’. 
c 


Since D(x) is spherical, A is a multiple of the identity: A = ¢I. Taking the trace and 
recalling that tr(xx’) = x’x, we have 


We have 


2 
= I[x||? 2 
(A) =EW,( =") Ixy? =a= 2, 
fe 
and hence A = (a/p)I. 
To calculate B, recall that 


2 
dial? 4, da _ 
oa da 


and hence 
= ail2 = = he 
V(x), W) = - {207 (eo) Saw + (“ Hl yr}. 


Then the same reasoning yields B = —(b/p)I, which implies 


a 
v= ah 
as stated. 
To compute c for normal data, note that it depends only on the distribution of 
\|x||?, which is 7°. 
This approach can also be used to calculate the efficiency of location S-estimators. 


6.17.8 The exact fit property 


Let the dataset X contain g > n — m* points on the hyperplane H = {x : B’x = 7}. It 
will be shown that fi(X) € H and E(X)B = 0. 

Without loss of generality we can take ||B|| = 1 and y = 0. In fact, the equation 
defining H does not change if we divide both sides by ||f||, and since 


W(X +a) = A(X) +a, BX +a) = F(X), 


we may replace x by x + a where f’a = 0. 

Now H = {x :: f’x = 0} is a subspace. Call P the matrix corresponding to the 
orthogonal projection on the subspace orthogonal to H; that is, P = Bf’. Define for 
teER 

y; = x; + (Bf’x; = (1+ (P)x,. 


Then Y = {y,,..,y,,} has at least g elements in common with x, since Pz = 0 for 
z € H. Hence by the definition of BP, fi(Y) remains bounded for all t. Since 


BY) = f(X) + tBB' R(X) 
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the left-hand side is a bounded function of t, while the right-hand side tends to infinity 
with t unless p’fi(X) = 0; that is, #(X) € H. 
In the same way 


E(V) = (1+ P)E(X) + tP) = EX) 
+ P(B'E(X)B)Bp' + (PEUX) + TCP) 


is a bounded function of t, which implies that Ex )p = 0. 


6.17.9 Elliptical distributions 


A random vector r € R? is said to have a spherical distribution if its density f depends 
only on ||r||; that is, it has the form 


f(r) = A((lr|D (6.143) 


for some nonnegative function h. It follows that for any orthogonal matrix T: 
D(Tr) = D(r). (6.144) 


In fact, (6.144) may be taken as the general definition of a spherical distribution, 
without requiring the existence of a density. However, we prefer the definition here 
for reasons of simplicity. 

The random vector x will be said to have an elliptical distribution if 


x=yp+Ar (6.145) 
where uw € R?, A & R?*? is nonsingular and r has a spherical distribution. Let 
x= AA’. 


We shall call and & the location vector and the scatter matrix of x, respectively. 
We now state the most relevant properties of elliptical distributions. If x is given by 
(6.145), then: 


1. The distribution of Bx + cis also elliptical, with location vector Bu + c and scatter 
matrix BXB’ 
2. If the mean and variances of x exist, then 


Ex = yw, Var(x) = cd, 


where c is a constant. 
3. The density of x is 


A(x p)'="'(«K- p)). 
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4. The distributions of linear combinations of x belong to the same location-scale 
family; more precisely, for any a € R? 


D(a'x) = D(a! + Va'Zar,), (6.146) 


where r, is the first coordinate of r. 


The proofs of (1) and (3) are immediate. The proof of (2) follows from the fact 
that, if the mean and variances of r exist, then 


Er=0, Var(r) = cI 


for some constant c. 


Proof of (4): It will be shown that the distribution of a linear combination of r does 
not depend on its direction. More precisely, for all a € R? 


Da’r) = D((Ial|r;). (6.147) 


In fact, let T be an orthogonal matrix with columns ty, ..., t, such that t, = a/|lal]. 
Then Ta = (|lal], 0, 0, ..., 0)’. 
Then by (6.144): 


D(a'r) = D(a'T'r) = D((Ta)'r) = D((lallr;) 


as stated; and (6.146) follows from (6.147) and (6.145). 


6.17.10 Consistency of Gnanadesikan-Kettenring correlations 


Let the random vector x = (x1,... Xp) have an elliptical distribution: that is, x = Az 
where Z = (z,,...,Z,) has a spherical distribution. This implies that for any u € R’, 


D(u'x) = D(||b]|z,) with b = A’u, 


and hence 
o(u’x) = Oo||b|| with op = o(z;). (6.148) 


Let U; = w’x (j = 1, 2) be two linear combinations of x. It will be shown that their 
robust correlation (6.69) coincides with the ordinary one. 
Assume that z has finite second moments. We may assume that Var(z) = I. Then 


Cov(U,,U) = bb,, Var(U;) = |IbjlI?, 


where b; = A’ u;, and hence the ordinary correlation is 


1P2 


Cc U,, U») = ———_.. 
om U1 Ua) = bal 
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Put, for brevity, o; = o(U)). It follows from (6.148) that 


U b b 
=) = | Di ae ae 
|b; Il [bo Ih 


U 
6; = |lbjlloo, | —+ : 


Oo, 9% 
and hence (6.69) yields 


bib, 


RC U,, U5) = ———— 
omU 1, Ua) = Te bal 


= Corr(U,, U2). 


6.17.11 Spherical principal components 


We may assume, without loss of generality, that 4 = 0. The covariance matrix of 
x/||x!| is 


/ 
U= eT (6.149) 


It will be shown that U and & have the same eigenvectors. 

It will be first assumed that 2 is diagonal: & = diag{d,,...,4,}, where Aj, ...,4, 
are its eigenvalues. Then the eigenvectors of & are the vectors of the canonical basis 
b,,...,b, with bj, = 6;. It will be shown that the bjs are also the eigenvectors of U; 
that is, 


Ub; = ab; (6.150) 


for some a;. For a given j, put u = Ub,;. Then we must show that k # j implies u, = 0. 
In fact, for k 4 j, 
XjXq 


uy = E 
IIx||?” 


where x; (j = 1,..,p) are the coordinates of x. The symmetry of the distribution 
implies that D(x;, x)= D(x;, —x;,) , which implies 
X;(—X,) 
k= = Uy 
I[x||? 


and hence u;, = 0. This proves (6.150). It follows from (6.150) that U is diagonal. 
Now let 2 have arbitrary eigenvectors t),..., t,. Call 4; V = 1, .., p) its eigenvalues, 
and let T be the orthogonal matrix with columns t,, ..., t,, so that 


x =TAT’, 


where A = diag{A,, ..., A}. We must show that the eigenvectors of U in (6.149) are 
the t's. 

Let z = T’x. Then z has an elliptical distribution with location vector 0 and scatter 
matrix A. The orthogonality of T implies that ||z|| = [|x|]. Let 


zu! ; 
V=E—_=TUT. (6.151) 
I|z\|? 
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It follows from (6.150) that the bjs are the eigenvectors of V, and hence that V is 
diagonal. Then (6.151) implies that U = TVT’, which implies that the eigenvectors 
of U are the columns of T, which are the eigenvectors of &. This completes the proof 
of the equality of the eigenvector of U and X. 

Now let o(.) be a scatter statistic. We shall show that the values of o(x’ t;)° are 
proportional to the eigenvalues of &. In fact, it follows from (6.146) and t Xr, =i, 


that for all j, 

D(t'x) =D (tu " Van) ; 
and hence 

o(t;x) =o (Var) | Ae 
with d = o(1,). 


6.17.12 Fixed point estimating equations and computing 
algorithm for the GS estimator 


Let x € R” be a possible observation and u the vector of zeros and ones indicating 
the missing observations (see Section 6.12.1). Call m € R? and = the mean vector 
and covariance matrix of x. Then we define ¥(u, x“), m, Z) as the best linear pre- 
dictor of x given x™, and C(u, Z) as the covariance matrix for the prediction error 
x — X(u,x, m, Z). In particular, if u has the first ¢g = p (u) entries equal to one 
and the remaining entries equal to zero, we have the following simple formulae. Let 


V=(Uj,...,0,)/E A, such that v; = --- =v, = Oand vj,,; = ++: =v, = | and put 
m” Zon Zuy 
_ (me cae 2yu Ly 
Then, 
x@) 
x(u, x, m, Z) = : (6.152) 
m) + bo 6 a —m) 
0 0 
Cu, X) = ( = ) : (6.153) 
0 Ly a San ay 


Then the fixed-point estimating equations for the GS-estimators m,, &,, and Z,, 
are the following: 


n= oS (6.154) 


and 


= Yi [; (&; — m,,) (& — m,,)! + w;w;C;] 


__ ooo 6.155 
" pan ww; 
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where x = _ messes «i > iis C; =C (u;, 3); WwW; = Ww (u;, ee m,, oi Sn)> 
“ih 
w* = w*(u;,X; ",M,, $5 sith 
1 1 
w(u < mE a2 [=| /p(u) , dx™,m©™, 2M) |x| /p(u) (6.156) 
1G 17pm) Cow S — |Q™ 1 /pH 


d(x, m”, x) 
p(u) 


3, = S*(m,,%,,) and = =%,z,,, where S,, satisfies (6.98). 
These equations show that the GS-estimators of location and scatter shape are a 
weighted mean and a weighted/corrected sample covariance matrix. If w; = w; = 1, 


1 <i <n we obtain the equations of the MLE for Gaussian data. 


w*(u,x™, m,Z) = : (6.157) 


6.17.12.1 Computing algorithm 


Based on the fixed points equations, a following natural algorithm can be used to 
compute the GS-estimators. Given initial estimators cm? - 30 ) 3), put G, = = £0 
and define the sequence an“ i: pa . ci », k > 0, using the recursion below. A proce- 
dure to compute the initial estimators a, GO 5 ry ) can be found in Danilov et al. 
(2012). 


Given na” £0 3), compute cal), psa gD) as follows: 


Mm, >4n > 5n 


+ eth) 
i 


n (6.158) 
Dini 
and 
ae yy [w OG = my) — im) + w" igi Oia 
aS Sa (6.159) 
je 1; Ww; 


where = = X(u,, ae ma” Hy C= C(u,, 3), ws ® = = w(u,, x") ih mi” 2, 

a) and w* = i tae ar m” 2), where w and w* are defined in (6.156) and 
(6.157) eapeutively. Set 3 x(t) = Sant”, 7 an d Se) = Genser where 
+ is the solution to (6. 98) with m, = = ath and &,, = x +) The iteration stops 
when |5;, gD) /S;, 3 — 1| < € for some apprapuaely chosen € > 0. 


Note that the recursion equations for the classical EM algorithm for Gaussian 
data can be obtained from (6.158) and (6.159) by setting ee y= = | for alli. 


6.18 Recommendations and software 


For multivariate location and scatter we recommend the S-estimator with Rocke 
p-function (Section 6.4.4) and the MM-estimator with SHR p-function (Section 6.5), 
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both starting from the Pefia—Prieto estimator (Section 6.9.2). They are implemented 
together in covRob (RobStatTM). These functions use as initial estimator the 
function initPP (RobStatTM), which implements the Pefia—Prieto estimator. As 
is explained in Section 6.10.5, we prefer the MM-estimator for p < 10 and the 
Rocke S-estimator for p > 10. Both are also available separately in covRobMM and 
covRobRocke. 

For missing data we recommend the generalized S-estimator with Rocke 
p-function (Section 6.12.2), which is implemented in GSE (GSE). 

For data with cell-wise contamination we recommend the two-step generalized 
S-estimator (Section 6.13),implemented in TSGS (2SGS). 

For principal components we recommend the M-S estimator of Section 6.11.2, 
implemented in peaRobS (RobStatTM). 

Another option are the spherical principal components (Section 6.11.1), imple- 
mented in PeaLocantore (rrcov). 

For mixed linear models we recommend the composite t-estimator (Section 
6.15.4). The function varComprob (RobustvarComp) computes this and other 
robust estimators for these models. 


6.19 Problems 


6.1. Show that if x has distribution (6.7), then Ex = w and Var(x) = c&. 
6.2. Prove that M-estimators (6.14)-(6.15) are affine equivariant. 


6.3. Show that the asymptotic value of an equivariant E fora spherical distribution 
is a scalar multiple of I. 


6.4. Show that the result of the first part of Section 6.17.4.1 is valid for any . 


6.5. Prove that if p(t) is a bounded nondecreasing function, then tp’(f) cannot be 
nondecreasing. 


6.6. Let ff and E be S-estimators of location and scatter based on the scale G, and 
let Gp) = G(d(X, fi, £)). Given a constant Oo, define ”* and ¥* as the values m 
and © that minimize |Z| subject to G(d(x, u, Z)) = oo. Prove that #* = fi and 
S* = (G)/o,)E 


6.7. Show that x and Var(X) are the values of fi and E minimizing |X| subject to 
(1/n) Yi) a(x; HE) = p 


6.8. Let f@ and z be the MCD estimators of location and scatter, which minimize 
the scale G(d,,...d,,) = ae d(). For each subsample A = {X;,,..,,X,, } of size 
h call x4 and C, the sample mean and covariance matrix corresponding to 
A. Let A* be a subsample of size h that minimizes |C,|. Show that A* is the 
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6.9. 


6.10. 
6.11. 


6.12. 
6.13. 


6.14. 
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set of observations corresponding to the h smallest values d(x;, ji, 5), and that 
i = X 4x and X = |Ca« PC is, 


Let fi and E be the MVE estimators of location and scatter. Let Ee: E* be the 
values of wz and & minimizing |X| under the constraint that the ellipsoid 


{xER? : (x— p=" «—p)S1)} 


of volume || contains at least nl 2 sample points. Show that #* = fi and 
X* = AL where A = Med {d(X, fi, Z)}. 


Prove (6.30). 


Let (x, y) be bivariate normal with zero means, unit variances and correlation p, 
and let y be a monotone y -function. Show that EQy(x)y(y)) is an increasing 
function of p (Hint: y = px + 1/1 — p2z with z ~ (0, 1) independent of x). 


Prove (6.19). 


The dataset glass from (Hettich and Bay, 1999) contains measurements of the 
presence of 7 chemical constituents in 76 pieces of glass from nonfloat win- 
dows. Compute the classical and robust estimators of location and scatter and 
the respective squared distances. For both, make the QQ plots of distances and 
the plots of distances against index numbers, and compare the results. 


The first principal component is often used to represent multispectral images. 
The dataset image (Frery, 2005) contains the values corresponding to three 
frequency bands for each of the 1573 pixels of a radar image. Compute the 
classical and robust principal components and compare the directions of the 
respective eigenvectors. and the fits given by the first component. 


7 


Generalized Linear Models 


In Chapter 4 we considered regression models where the response variable y depends 
linearly on several explanatory variables x,,...,x,. In this case, y was a quantitative 
variable; that is, it could take on any real value, and the regressors — which could be 
quantitative or qualitative — affected only its mean. 

In this chapter we shall consider more general situations in which the regressors 
affect the distribution function of y. However, to retain parsimony, it is assumed that 
this distribution depends on them only through a linear combination D4; where 
the #;s are unknown. 

We shall first consider the situation when y is a 0-1 variable. 


7.1 Binary response regression 


Let y be a 0-1 variable representing the death or survival of a patient after heart 
surgery. Here y = | and y = 0 represent death and survival, respectively. We want to 
predict this outcome by means of different regressors, such as x, = age, x = diastolic 
pressure, and so on. 

We observe (x, y) where x = (x),..., ae) is the vector of explanatory variables. 
Assume first that x is fixed (i.e., nonrandom). To model the dependency of y on x, we 
assume that P(y = 1) depends on f’x for some unknown f € R?. Since P(y = 1) € 
[0, 1] and B’x may take on any real value, we make the further assumption that 


P(y = 1) = F(6’®), (7.1) 


where F is any continuous distribution function. The function F~! is called the 
link function. If instead x is random, it will be assumed that the probabilities are 
conditional; that is, 

P(y = I |x) = F(p'x). (7.2) 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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In the common case of a model with an intercept, the first coordinate of each x; 
is one, and the prediction may be written as 


B'x; = By + X;B1, (7.3) 
with X, and f, as in (4.6). 
The most popular functions F are those corresponding to the logistic distribution 
y 


e 
l+ey 


EQ)= (7.4) 


(“logistic model”) and to the standard normal distribution F(y) = ®(y) (“probit 
model’). For the logistic model we have 


Py =1 
log ie p'x. 
1—PQ=1 
The left-hand side is called the log odds ratio, and is seen to be a linear function of x. 
Now let (x), y)),-..,(X,,y,,) be a sample from model (7.1), where x,,...,x, are 


fixed. From now on we shall write for simplicity 
p(B) = F(B'X;). 


Then y,,...,y, are response random variables that take on values | and 0 with prob- 
abilities p;(B) and 1 — p,(B) respectively, and hence their frequency function is 


PB) = p; (BC — p(B)". 


Hence the log-likelihood function of the sample L(B) is given by 


n 


log L(B) = Di Ly; log p(B) + (1 — y;) log(1 — p,(B))1- (7.5) 


i=l 


Differentiating (7.5) yields the estimating equations for the maximum likelihood 
estimator (MLE): 


n 


i= PAB) 
i=l p<B)A — p(B)) 


In the case of random x,, (7.2) yields 


F'(p' xx; = 0. (7.6) 


n n 


log L(B) = }'ly; log p(B) + (1 — y,) log(1 — p(B) + Y) log a(x), (7.7) 


i=1 i=1 


where g is the density of the x;. Differentiating this log likelihood again yields (7.6). 
For predicting the values y; from the corresponding regressor vector x;, the ideal 
situation would be that of “perfect separation’; that is, when there exist y € R? and 

a € Rsuch that 
y’x,>a if y,=1 


y’x;<a if y,=0, ee) 
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and therefore y’x = a is a separating hyperplane. It is intuitively clear that if one 
such hyperplane exists, there must be an infinite number of them. However, this 
has the consequence that the MLE becomes undetermined. More precisely, let 
B(k) = ky. Then 

Jim p(B) = jim FW) =1ify=1 


and 
jim D(B(k)) = lim F(u) = lify=0. 
> +00 u>—0o 


Therefore 
lim Y'D; log p(BW) + (1 = y,) 1og((1 — p,(B(R))] = O. 
i=l 


Since for all finite 


n 


» y; log p(B(k)) + C1 — yj) log — p(B(A)) < 0, 


i=1 


according to (7.5)-(7.7), the MLE does not exist for either fixed or random x;. 
Albert and Anderson (1984) showed that the MLE is unique and finite if and only 
ifno y © R? anda € R exist such that 


y’x;<a@ if y,=0. 


For y € R? anda € R, call K(y, a) the number of points in the sample which do 
not satisfy (7.8), and define 


ky = Pee K(y,@), (Yo, @) = arg ee K(y, @). (7.9) 
Then, replacing the ky points that do not satisfy (7.8) for y = yg and @ = ap (called 
overlapping points) with other ky points lying on the correct side of the hyperplane 
YX =a, the MLE goes to infinity. Then we can say that the breakdown point of 
the MLE in this case is ky/n. Observe that the points that replace the ky misclas- 
sified points are not “atypical”. They follow the pattern of the majority: those with 
YoXi > @ have y; = 1 and those with y5x; < a have y; = 0. The fact that the points 
that produce breakdown to infinity are not outliers was observed for the first time by 
Croux et al. (2002), who also showed that the effect produced by outliers on the MLE 
is quite different; it will be described later in this section. 

It is easy to show that the function (7.4) verifies F’(y) = F(y)(1 — F(y)). Hence in 
the logistic case, (7.6) simplifies to 


n 


Y0; — PAB): = 0. (7.10) 


i=1 
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We shall henceforth consider only the logistic case, which is probably the most 
commonly used, and which, as we shall now see, is easier to robustify. 
According to (3.48), the influence function of the MLE for the logistic model is 


IFQ, x, B) = M7! (y — F(B’x))x, 


where M = E(F’(f’x)xx’). Since the factor (y — F(B’x)) is bounded, the only out- 
liers that make this influence large are those such that ||x;|| > 00, y; = 1 and B’x; 
is bounded away from oo, or those such that ||x;|| > 00, y; = 0 and f’x; is bounded 
away from —oo. Croux et al. (2002) showed that if the model has an intercept (see 
(7.3)), then, unlike the case of ordinary linear regression, this kind of outlier makes 
the MLE of £;, tend to zero and not to infinity. More precisely, they show that by con- 
veniently choosing not more than 2(p — 1) outliers, the MLE B, of B, can be made 
as close to zero as desired. This is a situation where, although the estimator remains 
bounded, we may say that it breaks down since its values are determined by the out- 
liers rather than by the bulk of the data, and in this sense the breakdown point to zero 
of the MLE is < 2(p — 1)/n. 

To exemplify this lack of robustness, we consider a sample of size 100 from the 


model P b 
y — 
log ——_——— = fi + fx, 
BiG ap ht ax 
where fy = —2, 6; = 3 and x is uniform in the interval [0, 1]. Figure 7.1 shows the 


sample, and we find, as expected, that for low values of x, a majority of the ys are zero 
and the opposite occurs for large values of x. The MLE is fy) = —1.72, 6, = 2.76. 
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Figure 7.1 Simulated data: plot of y against x 
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Figure 7.2 Simulated data: effect of one outlier 


Now we add to this sample one outlier of the form x) =i and yg = 0 for 
i= 1,...,10, and we plot in Figure 7.2 the values of fp and f,. Observe that the value 
of f, tends to zero and f, converges to log(a/(1 — @)), where a = 45/101 & 0.45 
is the frequency of ones in the contaminated sample. 


7.2 Robust estimators for the logistic model 


7.2.1 Weighted MLEs 


Carroll and Pederson (1993) proposed a simple way to turn the MLE into an estima- 
tor with bounded influence, namely by downweighting high-leverage observations. 
A measure of the leverage of observation x similar to (5.87) is defined as 


h, (x) = (x — ft,)'S;' (x — p,))'”, 


where ju, and &,, are respectively a robust location vector and a robust scatter matrix 
estimator of x. Note that if ,, and &,, are affine equivariant, this measure is invariant 
under affine transformations. 

Then robust estimators can be obtained by minimizing 


Y wily logp,(B) + (- y,) log - p,(B))I. 
i=1 
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with 
w; = W(h,(x;)), (7.11) 


where W is a nonincreasing function such that W(u)u is bounded. According to (3.48), 
its influence function is 


IF(y, x, B) = M"'(y — F(B'x))xW(h(), 
with 
h(x) = (x = p)/Z7" (x — pw)”, (7.12) 
where p and © are the limit values of jf, and £,,, and 


M = E(W(h(x))F’(Bx)xx’). 


These estimators are asymptotically normal and their asymptotic covariance matrix 
can be found using (3.50). 

Croux and Haesbroeck (2003) proposed choosing W from the family of “hard 
rejection” weight functions, which depends on a parameter c > 0: 


1 if O<u<c 
wan ={ ie pee (7.13) 


and as ju, and &,, MCD estimators. 


7.2.2 Redescending M-estimators 


The MLE for the model can be also defined as minimizing the total deviance 


D(B) = Y) PAB).y))s 


i=1 


where d(u, y) is given by 
d(u, y) = {— 2[y log(u) + (1 — y) log(1 — w))}}! sgn(y — w) (7.14) 


and is a signed measure of the discrepancy between a Bernoulli variable y and its 
expected value u. Observe that 


0 if u=y 
dtu,y)=4-c if uw=1, y=0 
o if uw=0, y=1. 


In the logistic model, the values d(p,(B), y;) are called deviance residuals, and 
they measure the discrepancies between the probabilities fitted using the regression 
coefficients B and the observed values. In Section 7.3 we define the deviance residuals 
for a larger family of models. 
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Pregibon (1981) proposed robust M-estimators for the logistic model based on 
minimizing 


M(B) = >) oa (B).y)), 
i=l 


where p(u) is a function that increases more slowly than the identity function. 
Bianco and Yohai (1996) observed that for random x; these estimators are not 
Fisher-consistent; that is, the respective score function does not satisfy (3.32). 
They found that this difficulty may be overcome by using a correction term. They 
proposed estimating # by minimizing 


n 


M(B) = Pilea @i(B).¥))) + CB): (7.15) 


i=] 


where p(u) is nondecreasing and bounded and 
q(u) = v(u) + v0 — 0), 


with 


v(u) = 2 | * pties fdt 
0 


and y = p’. 

Croux and Haesbroeck (2003) described sufficient conditions on p to guarantee a 
finite minimum of M(f) for all samples with overlapping observations (ky > 0). They 
proposed choosing y in the family 


wl4(u) = exp (-Vmaxtu, O) (7.16) 


Differentiating (7.15) with respect to 6 and using the facts that 
q!(u) = 2y(—2 log u) — 2y(-2 log(1 — u)) 


and that in the logistic model F’ = F(1 — F), we get 
2 ¥) w(d?(B)); — PBX; 
i=1 


=—2 


PBC — p(B))lw(—2 log p(B) — w(—2 log(1 — p;(B))) 1x; = 0, 
i=l 
where d;(B) = d(p;(B), y;) are the deviance residuals given in (7.14). This equation 
can also be written as 


VW (apo — pi(B)) — Eg (4; (BQ; — PAB)IX/)1x; = 9, (7.17) 


i=1 
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where Eg denotes the expectation when P(y; = 1|x;) = p,(B). Putting 


YO; X;,B) = Ly(d?(B))(; — p(B) — Eg(w(a?(B))0; — PAB)IX)IX;, (7.18) 


equation 7.17 can also be written as 
DY) YO Xi B) = 0. 
i=! 


From (7.18) it is clear that Eg(‘P(y;, x;, B)) = 0, and therefore these estimators 
are Fisher-consistent. Their influence function can again be obtained from (3.48). 
Bianco and Yohai (1996) proved that under general conditions these estimators 
are asymptotically normal. The asymptotic covariance matrix can be obtained 
from (3.50). 

In Figure 7.3 we repeat the same graph as in Figure 7.2 using both estimators: 
the MLE and a redescending M-estimator with woe . We observe that the changes in 
both the slope and intercept of the M-estimator are very small compared to those of 
the MLE. 

Since the function (y,,x;, 8) is not bounded, the M-estimator does not 
have bounded influence. To obtain bounded influence estimators, Croux and 


Intercept 
-1.5-1.0 -0.5 0.0 0.5 


Intercept 
0.00.5 1.01.52.02.5 


2 4 6 8 10 
Outlier 


Figure 7.3 Effect of an oulier on the M-estimator of slope and intercept 
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Haesbroeck (2003) propose to downweight high-leverage observations. They define 
redescending weighted M- (WM.-) estimators by minimizing 


M(B) = ) wile @(P(B).y)) + GAB) (7.19) 
i=1 


where the weights w; are as in Section 7.2.1. 


Example 7.1 The following dataset was considered by Cook and Weisberg (1982, 
Ch. 5, p. 193) and Johnson (1985) to illustrate the identification of influential obser- 
vations. The data are records for 33 leukemia patients. Tables and figures for this 
example are obtained with script leukemia.R. 


The response variable is 1 when the patient survives more than 52 weeks. Two 
covariates are considered: white blood cell count (WBC) and presence or absence of a 
certain morphological characteristic in the white cells (AG). The model also includes 
an intercept. 

Cook and Weisberg detected an observation (#15) corresponding to a patient with 
WBC = 100.000, who survived for a long time. This observation was very influential 
on the MLE. They also noticed that after removing this observation a much better 
overall fit was obtained, and that the fitted survival probabilities of those observations 
corresponding to patients with small values of WBC increased significantly. 

In Table 7.1 we give the estimated slopes and their asymptotic standard deviations 
corresponding to: 


the MLE with the complete sample (MLE); 

the MLE after removing the influential observation (MLE__j5;); 

the weighted MLE (WMLE); 

the redescending M-estimator (M) corresponding to the Croux and Haesbroeck 
family wo with c = 0.5; 

e the redescending weighted M-estimator (WM). 


Table 7.1. Estimators for Leukemia data and their standard errors 


Estimate Intercept WBC(x107*) AG 

MLE —1.31 (0.81) —0.32 (0.18) 2.26 (0.95) 
MLE_,; 0.21 (1.08) —2.35 (1.35) 2.56 (1.23) 
WMLE 0.17 (1.08) —2.25 (1.32) 2.52 (1.22) 
M 0.16 (1.66) —1.77 (2.33) 1.93 (1.16) 
WM 0.20 (1.19) —2.21 (0.98) 2.40 (1.30) 
CUBIF —1.04 (0.85) —0.53 (0.30) 2.22 (0.98) 
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Figure 7.4 Leukemia data: ordered absolute deviances from the ML and WM- 
estimators 


We also show the results for the optimal conditionally unbiased bounded influence 
estimator (CUBIF), which will be described for a more general family of models in 
Section 7.3. 

We can observe that coefficients fitted with MLE_,; are very similar to those of 
the WMLE, M- and WM- estimators. The CUBIF estimator gives results intermediate 
between MLE and MLE_,s. 

Figure 7.4 compares the ordered absolute deviances of the ML and WM- 
estimates. It is seen that the latter gives a better fit than the former, except for 
observation 15, which WM clearly pinpoints as atypical. 


Example 7.2. The following dataset was introduced by Finney (1947) and later 
studied by Pregibon (1982) and Croux and Haesbroeck (2003). The response is the 
presence or absence of vasoconstriction of the skin of the digits after air inspiration, 
and the explanatory variables are the logarithms of the volume of air inspired (log 
VOL) and of the inspiration rate (log RATE). Tables and figures for this example are 
obtained with script skin.R. 


Table 7.2 gives the estimated coefficients and standard errors for the MLE, WMLE, 
M-, WM- and CUBIF estimators. Since there are no outliers in the regressors, the 
weighted versions give similar results to the unweighted ones. This also explains in 
part why the CUBIF estimator is very similar to the MLE. 
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Table 7.2 Estimates for skin data 


Estimator Intercept log VOL log RATE 


MLE ~9.53 (3.21) 3.88 (1.42) 2.65 (0.91) 
WMLE —9.51 (3.18) 3.87 (1.41) 2.64 (0.90) 
M —14.21 (10.88) 5.82 (4.52) 3.72 (2.70) 
WM —14.21 (10.88) 5.82 (4.53) 3.72 (2.70) 
CUBIF —9.47 (3.22) 3.85 (1.42) 2.63 (0.91) 
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Figure 7.5 Skin data: ordered absolute deviances from the ML and WM-estimators 


Figure 7.5 compares the ordered absolute deviances of the ML and 
WM-estimators. It is seen that the latter gives a better fit than the former for 
the majority of observations, and that it more clearly pinpoints observations 4 and 
18 as outliers. 


7.3 Generalized linear models 


The binary response regression model is included in a more general class called gen- 
eralized linear models (GLMs). If x is fixed, the distribution of y is given by a density 
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J{(. A) depending on a parameter A, and there is a known one-to-one function /, called 
the link function, and an unknown vector f, such that (A) = p’x. If x is random, the 
conditional distribution of y given x is given by f(y, 4) with (A) = B’x. In the previous 
section we had 4 € (0, 1) and 


fO,A) = RUA, (7.20) 
and] = F7!, 
A convenient framework is the exponential family of distributions: 
f(y, 4) = exp[m(A)y — G(m(A)) — t(y)], (7.21) 


where m, G and ¢ are given functions. When /=™m, the link function is called 
canonical. 
It is easy to show that if y has distribution (7.21), then 


E,(y) = g(m(A)), 


where g = G’. 
This family contains the Bernoulli distribution (7.20), which corresponds to 


m(A) = log G(u) = log(1 + e“) and f(y) = 0. 


1-1 
In this case, the canonical link corresponds to the logistic model and E,(y) = 4. 
Another example is the Poisson family with 


We 
y! 


fQ,4) = 
This family corresponds to (7.21), with 
m(A) = log A, G(u) = e“ and t(y) = log y! 


This yields Ey = g(m(A)) = A. The canonical link in this case is /(A) = log A. 
Define A(y) as the value of A that maximizes f(y, 4) or, equivalently, that 
maximizes 


log f(y, A) = m(A)y — G(m(A)) — 1). 
Differentiating, we obtain that A = A(y) should satisfy 
m'(A)y — g(m(A))m'(A) = 0 
and therefore g(m(A(y))) = y. Define the deviance residual function by 
d(y, A) = {2loglfy, A/FQ. AQ)? sgn(y — gn(A)) 
= {2[m(A)y — GOn(A)) — mA) y + GOnUQ)I}'? sgn(y — gGn(A))). 


It is easy to check that when y is a Bernoulli variable, this definition coincides 
with (7.14). 


GENERALIZED LINEAR MODELS 283 


Consider now a sample (xX, y;),...,(X,».Y,) from a GLM with the canonical link 
function and fixed x;. Then the log likelihood is 


log L(B) = D' log f(y;,m™'(B'x;)) 


i=1 
= Vex); - Y GB'x) - 10). (7.22) 
i=1 i=! i=l 
The MLE maximizes log L(#) or equivalently 


Y 2dog f0;,m™"(B'x;)) — log fo; AQ) 
i=1 

= d’(y;, m_'(p’x;)). 
i=1 


Differentiating (7.22) we get the equations for the MLE: 


Y0;- 8(B'x))x; = 0. (7.23) 


i=1 


For example, for the Poisson family this equation is 


Yo; - ex; = 0. (7.24) 


i=1 


7.3.1 Conditionally unbiased bounded influence estimators 
To robustify the MLE, Ktinsch et al. (1991) consider M-estimators of the form 


D, POX B) = 0, 


i=I 
where Y : R!+?+P -, RP such that 
E(P(y, x, B)|x;) = 0. (7.25) 


These estimators are referred to as conditionally unbiased bounded influence 
(CUBIF). Clearly these estimators are Fisher-consistent; that is, E(¥(y;, x;, B)) = 0. 
Kiinsch et al. (1989) found the estimator in this class that solves an optimization 
problem similar to Hampel’s one, as studied in Section 3.5.4. This estimator 
minimizes a measure of efficiency — based on the asymptotic covariance matrix 
under the model — subject to a bound on a measure of infinitesimal sensitivity similar 
to the gross-error sensitivity. Since these measures are quite complicated and may 
be controversial, we do not give more details about their definition. 
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The optimal score function has the following form: 


YO, X, B, b, B) = W(B, y, X, b, B) {y = g(B'x) 7e (o's aa) \ X, 

h(x, B) 
where b is the bound on the measure of infinitesimal sensitivity, B is a dispersion 
matrix that will be defined below, and h(x, B) = (x/B7!x)!/? is a leverage measure. 
Observe the similarity with (7.23). The function W downweights atypical obser- 
vations and makes ’ bounded, and therefore the corresponding M-estimator has 
bounded influence. The function c(B’x, b/h(x,B)) is a bias-correction term chosen 
so that (7.25) holds. Call r(y, x, B) the corrected residual: 


ry, x, B, b, B) ae hae g(B’x) mG. (o's wm) : (7.26) 


Then the weights are of the form 
W(B, y, x, b, B) = W,(r(y, x, B)A(x, B)), 


where W, is the Huber weight function (2.33) given by 


W,(x) = min { 1, 2 : 
|x| 


Then, as in the Schweppe-type GM-estimators of Section 5.11.1, W downweights 
observations for which the product of corrected residuals and leverage has a 
high value. 

Finally, the matrix B should satisfy 


E(¥(, x, B, b, B)P’(y, x, B, b, B)) = B. 


Details of how to implement these estimators, in particular of how to estimate B, 
and a more precise description of their optimal properties can be found in Kiinsch 
et al. (1989). We shall call these estimators optimal conditionally unbiased bounded 
influence, or optimal CUBIF estimators for shott. 


7.4 Transformed M-estimators 


7.4.1 Definition of transformed M-estimators 


One source of difficulties for the development of robust estimators in the GLM is 
that — unlike in the linear model — the variability of the observations depends on the 
parameters, and this complicates assessing the outlyingness of an observation. There 
are, however, cases of GLMs in which a “variance stabilizing transformation” exists, 
and this fact greatly simplifies the problem. Let f(y, A) be a family of discrete or 
continuous densities with a real parameter A, and let t : R > R be a function such 
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that the variance of f(y) is “almost” independent of 4. Given a p-function, a natural 
way to define to define M-estimators of / is as follows. Let m(A) be defined by 


m(A) = arg min E(p(¢(y) — m)), 


where y has density f(y, 4). Then, given a sample y,,...,,, we can define a Fisher- 
consistent estimator of A by 


A = arg min » p(t(y;) — m(A)) 


We call this estimator the transformed M-estimator (MT-estimator). The fact that the 
variability of t(y) is almost constant makes scale estimation unnecessary. 

Let (x), y}),..-,(X,,Y,) be a sample from a GLM such that y|x has density 
FQ, A), and the link function is (A) = p,%: Then, Valdora and Yohai (2014) defined 
the MT-estimator of B by 


B = arg min ¥ p(e(y;) — m('(B'X)) (7.27) 
i=1 

The MT-estimators are not applicable when f(y, 4) is the Bernoulli family of 
distributions, since in this case these estimators coincide with the untransformed 
M-estimator. 

Since p is a bounded function, the estimator defined in (7.27) is already robust. 
However, if high-leverage outliers are expected, penalizing high-leverage observa- 
tions may increase the estimator’s robustness. For this reason Valdora and Yohai 
(2014) defined weighted MT-estimators (WMT-estimators) as 


n 


B = argmin 2 w;p(t(y;) — m(I-!(B’x))) (7.28) 


where the weights w, are defined as in Section 7.2.1; that is, w; = W(h(x;)) with h 
defined as in (7.12). 

Valdora and Yohai (2014) show that, under very general conditions, the 
WMT-estimators and in particular the MT-estimators are consistent and asymptoti- 
cally normal. A family of p-functions satisfying these conditions is given by 


2 4 
peu) = 1 - (1 -(*) ) Kul <0), (7.29) 


where c is a positive tuning constant that should be chosen by trading-off between 
efficiency and robustness. Note the similarity with the bisquare function given 
in (2.38). The reason to use this function is that it has three bounded derivatives, 
as required for the asymptotic normality of the WMT-estimators. In contrast, the 
bisquare p-function has only two. Generally, while increasing c, we gain efficiency 
while the robustness of the estimator decreases. 
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Consider in particular the case of Poisson regression with /(A) = log(A). It is 
well-known that a variance-stabilizing transformation for this family of distributions 
is t(y) = Vy: Monte Carlo simulations in Valdora and Yohai (2014) show that in 
this case the MT-estimator, with p in the family p, given in (7.29), and with c = 2.3, 
compares favorably with other proposed robust estimators, such as optimal CUBIF 
or the robust quasi-likelihood estimators defined in Section 7.4.3. 


7.4.2 Some examples of variance-stabilizing transformations 


Consider a family of densities f(y, 2) such that the corresponding means and variances 
are (A) and v(A) respectively. A simple Taylor expansion shows that 


= . du 
ty) = | Vay (7.30) 


has approximately constant variance. From (7.30) we obtain that if v(A) = (A)? then 


—(q/2)+1 if 2 
a}? if q# 
t(y) = { ieeG. HE g =e (7.31) 


In the Poisson case, we have (A) = v(A) =A, and this yields g=1 and 


ty) = /y. 


In the case of an exponential distribution; that is, when 
fQ,a) = eI > 0), 


we have (A) = 1/4 and v(A) = 1/4”. Then g = 2 and f(y) = log(y). 
In the case of a binomial distribution with k trials and success probability A 
(Bi(A, k)); that is, when 


fo,a)= (*) PU - ae, O<y<k 


with k known, we have (A) = kA and v(A) = kA(1 — A). Then in this case applying 
(7.30) we get t(y) = arcsin +/y/k. 


7.4.3 Other estimators for GLMs 


Bianco et al. (2013) extended the redescending M-estimators of Section 7.2.2 to 
other GLMs. Bianco et al. (2005) considered M-estimators for the case when the 
distribution of y is gamma and the link function is the logarithm. They showed that 
in this case no correction term is needed for Fisher-consistency. 

Bergesio and Yohai (2011) extended the projection estimators of Section 5.11.2 to 
GLMs. These estimators are highly robust but not very efficient. For this reason they 


TRANSFORMED M-ESTIMATORS 287 


propose using a projection estimator followed by a one-step M-estimator. In this way 
a highly robust and efficient estimator is obtained. A drawback of these estimators is 
their computational complexity. 

Cantoni and Ronchetti (2001) robustified the quasi-likelihood approach to esti- 
mate GLMs. The quasi-likelihood estimators proposed by Wedderburn (1974) are 
defined as solutions of the equation 
y yi — H(B'X;) 


Vipxy “Px HO 
i=1 i 


where 
H(A) = E,(y), WA) = Var,0). 


The robustification proposed by Cantoni and Ronchetti is performed by bounding 
and centering the quasi-likelihood score function 


yr u(B'x) I; ol 
wpx ” (B'X)x, 


similar to what was done with the maximum likelihood score function in 
Section 7.3.1. The purpose of centering is to obtain conditional Fisher-consistent 
estimators and that of bounding is to bound the IF. 

To cope with high-leverage points, they propose giving weights to each observa- 
tion, as in the definition of the weighted maximum likelihood estimators defined in 
Section 7.2.1. 


y(y, B) = 


Example 7.3 Breslow (1996) used a Poisson GLM to study the effect of a drug in 
epilepsy patients using a sample of size 59. Tables and figures for this example are 
obtained with script epilepsy.R. 


The response variable is the number of attacks during four weeks (sumY) in 
a given time interval and the explanatory variables are: patient age divided by 10 
(Agel0), the number of attacks in the four weeks previous to the study (Base4), 
a dummy variable that takes values one or zero if the patient received the drug or a 
placebo respectively (Trt) and an interaction term (Base4*Trt). We fit a Poisson GLM 
with log link using five estimators: the MLE, the optimal CUBIF, a robustified quasi 
likelihood (RQL) estimator, the projection estimator followed by a one-step estimator 
(MP) and the MT-estimator with p,. given in (7.29) and c = 2.3. Figure 7.6 shows box- 
plots of the absolute values of the respective deviance residuals. The left-hand plot 
shows the residuals corresponding to all the observations. It is seen that all robust 
estimators identify a large outlier (observation 49), while the MLE identifies none. 
In order to make the boxes more clearly visible, the right-hand plot shows the boxplots 
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Figure 7.6 Epilepsy data: boxplots of deviances 


Table 7.3 Estimates for epilepsy data 


Estimator Intercept 


Agel0 


Base4 Trt 


a 


QL MP 


Base4* Trt 


MLE 1.84 (0.13) 
MLE_j,) — 1.60 (0.15) 
CUBIF 1.84 (0.69) 


MT 1.62 (0.26) 
MP 2.00 
RQL 2.04 (0.15) 


0.24 (0.04) 
0.29 (0.04) 
0.12 (0.12) 
0.15 (0.091) 
0.071 

0.16 (0.047) 


0.09 (0.002) —0.13 (0.04) 
0.10 (0.004)  —0.24 (0.05) 
0.14 (0.24) —0.41 (0.12) 
0.17 (0.013) | —0.60 (0.29) 
0.13 —0.49 

0.084 (0.004) —0.33 (0.86) 


0.004 (0.002) 
0.018 (0.004) 
0.022 (0.022) 
0.042 (0.03) 
0.0476 

0.012 (0.0049) 


without the outliers. It is seen that the MT-estimator gives the best fit for the bulk of 


the data. 


The coefficient estimates and their standard errors are shown in Table 7.3. 
Figure 7.7 compares the ordered absolute deviances corresponding to the MT- and 
ML estimators. It is seen that MT- gives a much better fit to all the data, except for 
observation 49, which is pinpointed as an outlier. 
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Figure 7.7 Epilepsy data: ordered absolute deviances from the ML and MT- 


estimators 


7.5 Recommendations and software 


For logistic regression, we recommend the redescending weighted M-estimator 


defined in Section 7.2.2, which is implemented in logregWBY (RobStatTM). 


Other options are: 


e to use an M-estimator (Section 7.2.2) that can be computed with logregBY 
(RobStatTM); 


WML (RobStatTM); 


glmRob (robust) with the parameter “method” equal to “cubif”; 


a weighted maximum likelihood estimator (Section 7.2.1) implemented in logreg- 
the conditional unbiased bounded influence estimators that can be computed with 


the robust quasi-likelihood estimator (Section 7.4.3) that can be computed with 
glmrob(robustbase) with the parameter “method” equal to “Male”. 


For Poisson regression we recommend the MT-estimator defined in Section 7.4, 
computed with glmrob (robustbase) with the parameter “method” equal 


to “MT”. 


Another option is the robust quasi likelihood estimator (Section 7.4.3), which 
can be computed with glmrob (robustbase) with the parameter “method” equal 


to “Male”. 
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7.6 Problems 


7.1. The dataset neuralgia (Piegorsch, 1992) contains the values of four predictors 
for 18 patients, the outcome being whether the patient experimented pain relief 
after a treatment. The data are described in Chapter 11. Compare the fits given 
by the different logistic regression estimators discussed in this chapter. 


7.2. Prove (7.10). 


7.3. Consider the univariate logistic model without intercept for the sample 
(x1,y1),-++> (%»¥,) with x; € R; that is, B’x = px. Let 


p(x, B) = e* /(1 + e*) = Py = 1). 


(a) Show that 
A,(B) = 007 = PO BY; 
i=] 


is decreasing in f. 

(b) Call B, the ML estimator. Assume B, > 0. Add one outlier (K,0) where 
K > 0; call B, 4,(K) the MLE computed with the enlarged sample. Show 
that limg 8, 4 (K) = 0. State a similar result when B, <0. 


7.4. Let Z, = {(X,,¥1),--->(X,,),)}, be a sample for the logistic model, where the 
first coordinate of each x; is | if the model contains an intercept. Consider a new 
sample Z7 = {(—x,, 1 — y,),...,(—x,, 1 —y,)}. 

(a) Explain why is desirable that an estimator f satisfies the equivariance 
property B(Z;) = B(Z,). 
(b) Show that M-, WM- and CUBIF estimators satisfy this property. 


7.5. For the model in Problem 7.3, define an estimator by the equation 


Y0; — P@;, B))sgn(x;) = 0. 

i=l 
Since deleting all x; = 0 yields the same estimate, it will be henceforth assumed 
that x; 4 O for all 7. 


(a) Show that this estimator is Fisher-consistent. 

(b) Show that the estimator is a weighted ML estimator. 

(c) Given the sample Z, = {(x;,y,),i= 1,...,n}, define the sample Z* = 
{Q",y*),i=1,...,n}, where (x7, y*) = (a, y,) if x; >0 and (xF,y%) = 
(—x;, 1 — y,) if x; < 0. Show that f,(Z,,) = B,(Z*). 
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7.6. 


(d) Show that #,,(Z%) fulfills the equation)” , yy = Die: P@;, B), and hence 
B, (Z;,) is the value of f that matches the empirical frequency of y; = 1 with 
the theoretical one. 

(e) Prove that }),p(x*,0) = n/2. 

(f) Show that ifn is even, then the minimum number of observations that it is 
necessary to change in order to make B, = Ois|n/2->, yl. 

(g) Discuss the analogue property for odd n. 

(h) Show that the influence function of this estimator is 


IF(y, x, 8) = (y = Pes Bsent) 


where A = E(p(x, £)(1 — pa, f))|x|), and hence GES = 1/A. 


Consider the CUBIF estimator defined in Section 7.3. 


(a) Show that he correction term c(a, b) defined above (7.26) is a solution of the 
equation 
Ei (WE — g(@ — c(a,b))) = 0. 


(b) In the case of the logistic model for the Bernoulli family put g(a) = e%/ 
(1 + e*). Then prove that c(a, b) = c*(g(a), b), where 
(1-p\(p-b)/p if p > max (3,4) 
c*(p,b) =4 p(b—-1+4p)/(1—p) if p< min (45 


0 elsewhere. 


a 


(c) Show that the limit when b > 0 of the CUBIF estimator for the model in 
Problem 7.3 satisfies the equation 
n os Xj, 
¥ = PHB) sgn(x;) = 0. 
j=l max(p(x;, Pp), 1 = P(X;, B)) 


Compare this estimator with the one of Problem 7.5. 
(d) Show that the influence function of this estimator is 


1 Op, B))sgnQ;) 

A max(p(x, B), 1 — p(x, B)) 
with A = E(min(p(, B)(1 — p(x, P))|x|); and that the gross error sensitivity 
is GES(f) = 1/A. 

(e) Show that this GES is smaller than the GES of the estimator given in 
Problem 7.5. Explain why this may happen. 


TF(y, x, B) = 


8 


Time Series 


Throughout this chapter we shall focus on time series in discrete time; those whose 
time index ¢ is integer valued; that is, t= 0,+1,+2,.... We shall typically label the 
observed values of time series as x, or y,, and so on. 

We shall assume that our time series is either stationary in some sense or may be 
reduced to stationarity by a combination of elementary differencing operations and 
regression trend removal. Two types of stationarity are in common use, second-order 
stationarity and strict stationarity. The sequence is said to be second-order (or 
wide-sense) stationary if the first- and second-order moments Ey, and E(y,,y;,) exist 
and are finite, with Ey, = a constant independent of t, and the covariance of y,,/ 
and y, depends only on the lag /: 


Cov(y,4),y,) = C(D for all t, (8.1) 


where C is called the covariance function, or alternatively the autocovariance 
function. 

The time series is said to be strictly stationary if, for every integer k > | and 
every subset of times ¢), f),...,,, the joint distribution of y,,,y,,,.--,Y,, 18 invariant 
with respect to shifts in time; that is, for every positive integer k and every integer / 
we have 


DO se Yiggh Vd = DOG Vir Vado 


where D denotes the joint distribution. A strictly stationary time series with finite 
second moments is obviously second-order stationary, and we shall assume unless 
otherwise stated that our time series is at least second-order stationary. 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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8.1 Time series outliers and their impact 


Outliers in time series are more complex than in the situations dealt with in the pre- 
vious chapters, where there is no temporal dependence in the data. This is because in 
the time series setting we encounter several different types of outliers, as well as other 
important behaviors that are characterized by their temporal structure. Specifically, 
in fitting time series models we may have to deal with one or more of the following: 


e isolated outliers 
e patchy outliers 
e level shifts in mean value. 


While level shifts have a different character than outliers, they are a frequently 
occurring phenomenon that must be dealt with in the context of robust model fitting, 
and so we include them in our discussion of robust methods for time series. The 
following figures display time series which exhibit each of these types of behavior. 
Figure 8.1 shows a time series of stock returns for a company, with stock ticker NHC, 
which contains an isolated outlier. Here we define stock returns r, as the relative 
change in price r, = (p, — Py-1)/Pr-- 

Figure 8.2 shows a time series of stock prices (for a company with stock ticker 
WYE) which has a patch outlier of length four with roughly constant size. Patch 
outliers can have different shapes or “configurations”. For example, the stock returns 
for the company with ticker GHI in Figure 8.3 have a “doublet” patch outlier. 
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Figure 8.1 Stock returns (NHC) with isolated outlier 
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Figure 8.3. Stock returns (GHI) with doublet outlier 
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Figure 8.4 Stock prices with level shift 


The doublet outlier in the GHI returns arises because of the differencing operation 
in the two returns computations. This involves the isolated outlier in the GHI 
price series. 

Figure 8.4 shows a price time series for Dow Jones (ticker DOW), which has a 
large level shift at the beginning of October. Note that this level shift will produce an 
isolated outlier in the Dow Jones returns series. 

Finally, Figure 8.5 shows a time series of tobacco sales in the UK (West and 
Harrison, 1997), which contains both an isolated outlier and two or three level shifts. 
The series also appears to contain trend segments at the beginning and end of the 
series. It is important to note that since isolated outliers, patch outliers and level shifts 
can all occur in a single time series, it will not suffice to discuss robustness toward 
outliers without taking into consideration handling of patch outliers and level shifts. 
Note also that when one first encounters an outlier — that is, as the most recent obser- 
vation in a time series — then lacking side information we do not know whether it is 
an isolated outlier, or a level shift or a short patch outlier. Consequently it will take 
some amount of future data beyond the time of occurrence of the outlier in order to 
resolve this uncertainty. 


8.1.1 Simple examples of outliers influence 


Time series outliers can have an arbitrarily adverse influence on parameter estimators 
for time series models, and the nature of this influence depends on the type of outlier. 
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Figure 8.5 Tobacco and related sales in the UK 


We focus on the lag-k autocorrelation 


Corbet _ CU) 


Var(y,) C(O)’ 2) 


Ak) = 
Here we take a simple first look at the impact of time series outliers of different 
types by focusing on the special case of the estimation of p(1). Let y,, y>,..., yr be the 
observed values of the series. We initially assume for simplicity that = Ey = 0. In 
that case, a natural estimator p(1) of p(1) is given by the lag-1 sample autocorrelation 
coefficient = 
_ Dial Ye 


T 
Te y 


It may be shown that |@(1)| < 1, which is certainly a reasonable property for such 
an estimator (see Problem 8.1). 

Now suppose that for some fg, the true value y,, is replaced by an arbitrary value A, 
where 2 < ft) < T — 1. In this case the estimator becomes 


a1) (8.3) 


pe ayeae E {tf — 1, %}) Yig-1 A+A Vigt] 
YL y2U(t # to) + A2 YL W2Ut # ty) + A2 


Since A appears quadratically in the denominator and only linearly in the numer- 
ator of the above estimator, f(1) goes to zero as A > oo, with all other values of y, for 
t # t, held fixed. So, whatever the original value of (1), the alteration of an original 
value y,, to an outlying value y,, = A results in a “bias” of p(1) toward zero, the more 
so the larger the magnitude of the outlier. 


pl) = 
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Now consider the case of a patch outlier of constant value A and patch length k, 
where the values y; for i = fp,...,f) + & — 1 are replaced by A. In this case the above 
estimator has the form 


Dict es ¢ [f — lt +k-1)} 
Vig-t A+ (Kk — DA? +A Vinge 


p(1) = 


and therefore 

lim p(1) = ons 

A> oo k 
Hence, the limiting value of a(1) with the patch outlier can either increase or decrease 
relative to the original value, depending on the value of k and the value of p(1) without 
the patch outlier. For example, if k = 10 with p(1) = 0.5 without the patch outlier, 
then p(1) increases to the value 0.9 as A > oo. 

In some applications, one may find that outliers come in pairs of opposite signs. 
For example, when computing first differences of a time series that has isolated out- 
liers, we get a doublet outlier, as shown in Figure 8.3. We leave it to the reader to 
show that for a doublet outlier with adjacent values having equal magnitude but oppo- 
site signs — that is, values +A — the limiting value of p(1) as A > oo is p(1) = —0.5 
(Problem 8.2). 

Of course, one can seldom make the assumption that the time series has zero 
mean, so one usually defines the lag-1 sample autocorrelation coefficient using the 


definition a, _ _ 
at) = Se ue = ere y) (8.4) 
pa (y,-y) 


Determining the influence of outliers in this more realistic case is often alge- 
braically quite messy, but achievable. For example, in the case of an isolated outlier 
of size A, it may be shown that the limiting value of (1) as A > oo is approximately 
—1/T for large T (Problem 8.3). However, it is usually easier to resort to some type 
of influence function calculation, as described in Section 8.11.1. 


8.1.2 Probability models for time series outliers 


In this section we describe several probability models for time series outliers, 
including additive outliers (AOs), replacement outliers (ROs) and innovations 
outliers (IOs). Let x, be a wide-sense stationary “core” process of interest, and let 
v, be a stationary outlier process that is non-zero a fraction € of the time; that is, 
P(v, = 0) = 1 —«. In practice the fraction € is often positive but small. 

Under an AO model, instead of x, one actually observes 


yp =X, +Y, (8.5) 
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where the processes x, and v, are assumed to be independent of one another. A spe- 
cial case of the AO model was originally introduced by Fox (1972), who called them 
Type I outliers. Fox attributed such outliers to a “gross-error of observation or a 
recording error that affects a single observation”. The AO model will generate mostly 
isolated outliers if v, is an independent and identically distributed (.i.d.) process, with 
standard deviation (or scale) much larger than that of x,. For example, suppose that 
x, is a zero-mean normally distributed process with Var(x,) = 02, and v, has a normal 
mixture distribution with degenerate central component 


v, ~ 1 — €)d9 + EN(H,, o : (8.6) 


Here 6, is a point-mass distribution located at zero, and we assume that the normal 
component N(y/,,, 02) has variance o2 >> o2. In this case y, will contain an outlier at 
any fixed time ¢ with probability €, and the probability of getting two outliers in a row 
is the much smaller €7. It will be assumed that yz, = 0 unless otherwise stated. 
Additive patch outliers can be obtained by specifying that at any given f, v, = 0 
with probability | — ¢; and with probability e, v, is the first observation of a patch 
outlier having a particular structure that persists for k time periods. We leave it for 
the reader (Problem 8.4) to construct a probability model to generate patch outliers. 
RO models have the form 


y, = 1 — 2x, + 2, (8.7) 


where z, is a 0-1 process with P(z, = 0) = 1 — e, and w, is a “replacement” process 
that is not necessarily independent of x,. In fact, RO models contain AO models as 
a special case in which w, = x, + v, and z, is a Bernoulli process; that is, z, and z,, 
are independent for ¢ 4 u. Outliers that are mostly isolated are obtained, for example, 
when z, is a Bernoulli process and x, and w, are zero-mean normal processes with 
Var(w,) = 02, >> 02. For the reader familiar with Markov chains, we can say that 
patch outliers may be obtained by letting z, be a Markov process that remains in the 
“one” state for more than one time period (of fixed or random duration), and w, has 
an appropriately specified probability model. 

IOs are a highly specialized form of outlier that occur in linear processes such 
as AR, ARMA and ARIMA models, which will be discussed in subsequent sections. 
IO models were first introduced by Fox (1972), who termed them Type II outliers, 
and noted that an IO “will affect not only the current observation but also subsequent 
observations”. For the sake of simplicity, we illustrate IOs here in the special case of 
a first-order autoregression model, which is adequate to reveal the character of this 
type of outlier. A stationary first-order AR model is given by 


X, = PX,_1 + Uy; (8.8) 


where the innovations process u, is i1.i.d. with zero mean and finite variance, and 
|p| < 1. An IO is an outlier in the u, process. IOs are obtained, for example, when 
the innovations process has a zero-mean normal mixture distribution 


(1 — €)N(0, 05) + EN(0, 07) (8.9) 
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where oF > ae. More generally, we may say that the process has IOs when the dis- 
tribution of u, is heavy-tailed (e.g., a Student f-distribution). 


Example 8.1. AR(J) with innovation outlier and additive outliers 


To illustrate the differences between AOs and IOs, we display in the first row of 
Figure 8.6 a Gaussian first-order AR series x, of length 100, with parameter ¢ = 
0.9, and free of outliers. The second shows the same series with ten additive outliers 
(marked with circles) obtained by adding the value 4 at ten equidistant positions. The 
third row displays the same series with one innovation outlier at position 50, also 
marked with a circle. This IO was created by replacing (only) the normal innovation 
us by an atypical innovation with value us, = 10. The persistent effect of the IO 
at t = 50 on subsequent observations is quite clear. The effect of this outlier decays 
roughly as °° for times t > 50. 

One may think of an IO as an “impulse” input to a dynamic system driven by 
a background of uncorrelated or i.i.d. white noise. Consequently, the output of the 
system behaves transitorily like an impulse response — a concept widely used in linear 
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Figure 8.6 Top, a Gaussian AR(1) series with p = 0.9; middle, the same series with 
10 additive outliers at equidistant locations; bottom, the same series with one additive 
outlier at location 50 
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systems theory — immediately after the occurrence of the outlier. It will be seen in 
Section 8.4.3 that IOs are “good” outliers, in the sense that they can improve the 
precision of the estimation of the parameters in AR and ARMA models, for example 
the parameter @ in the AR(1) model. 


8.1.3 Bias impact of AOs 


In this section we provide a simple illustrative example of the bias impact of AOs on 
the estimation of a first-order zero-mean AR model. 

The reader may easily check that the lag-1 autocorrelation coefficient p(1) for 
the AR(1) model (8.8) is equal to @ (see Problem 8.6). Furthermore, a natural 
least-squares (LS) estimator b of @ in the case of perfect observations y, = x, is 
obtained by solving the minimization problem 


T 
min YO; - dy. (8.10) 
t=2 


Differentiation with respect to @ gives the estimating equation 
T 


¥-10; — y;-1) =0 
2 


t= 


and solving for b gives the LS estimator 


i 
~ —2 Vy t- 
oe Ze . (8.11) 
pe 
A slightly different estimator is 
‘a 
o* = as Vr : (8.12) 


T 
Di yy 


which coincides with (8.3). The main difference between these two estimators is that 
|@*| is bounded by one, while || can take on values larger than one. Since the true 
autocorrelation coefficient @ has |@| < 1, and actually |¢| < 1 except in the case of 
perfect linear dependence, the estimator #* is usually preferred. 

Let p,(1) be the lag-1 autocorrelation coefficient for the AO observations y, = x, + 
v, where x, is given by (8.8). It may be shown that when T > oo, g* converges to p,(1) 
under mild regularity conditions, and the same is true of p (Brockwell and Davis, 
1991). If we assume that v, is independent of x,, and that v, has lag-1 autocorrelation 
coefficient p,,(1), then 

nye Cov(y;, ¥;—-1) _ Cov(x;,, X;_1) + Cov(y;, 0-1) 
Var(y,) +o 
- o2p + 02p,(1) 


o2 +02 
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The large-sample bias of ¢* is 


2 


Bias(p*) = p,(1) - 6 = 7" (0, (1) — #) 
of toy 
=~ (p,(1) - 4) (8.13) 
~ TR"? , 


where R = 02 /o2 is the “noise-to-signal” ratio. We see that the bias is zero when R is 
zero; that is, when the AOs have zero variance. The bias is bounded in magnitude by 
|p,(1) — @| and approaches |p,,(1) — @| as R approaches infinity. When the AOs have 
lag-1 autocorrelation, p,(1) = 0 and R is very large, the bias is approximately —f 
and correspondingly the estimator #* has a value close to zero. As an intermediate 
example, suppose that p,(1) = 0, @ = 0.5, o2 = | and that v, has distribution (8.6) 
with € = 0.1, ,, = 0 and o? = 0.9. Then the bias is negative and equal to —0.24. On 
the other hand, if the values of p,(1) and ¢ are interchanged, with the other parameter 
values remaining fixed, then the bias is positive and has value +0.24. 


8.2 Classical estimators for AR models 


In this section we describe the properties of classical estimators of the parameters 
of an autoregression model. In particular we describe the form of these estimators, 
state the form of their limiting multivariate normal distribution, and describe their 
efficiency and robustness in the absence of outliers. 

An autoregression model of order p, called an AR(p) model for short, generates 
a time series according to the stochastic difference equation 


Ye = 1+ G1y-1 + Boyj-2 +--+. + bpYt—p + U, (8.14) 


where the innovations u, are an 1.1.d. sequence of random variables with mean 0 and 
finite variance o2. The innovations are assumed to be independent of past values of the 
y,S. It is known that the time series y, is stationary if all the roots of the characteristic 
polynomial 


PZ) =1-$,2- G2 -... —b (8.15) 


lie outside the unit circle in the complex plane, and the sum of the @,; is less than one. 
When jy, is stationary, it has a mean value y = E(y,) that is determined by taking the 
mean value of both sides of (8.14), giving 


H=7+O n+ Gut... +h,u +0, 


which implies 


Pp 
(1- D4) =% (8.16) 
i=l 
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and hence : 
f= —_, (8.17) 
1- i=1 ?; 


In view of (8.16), the AR(p) model may also be written in the form 
Y,— H=OOj-1 — + b20j-2 — I) +--+ Opp — HD + Uy (8.18) 


There are several asymptotically equivalent forms of LS estimators of the AR(p) 
model parameters that are asymptotically efficient when the distribution of u, is nor- 


mal. Given a sample of observations y,, y2,..., yr, it seems natural at first glance to 
compute LS estimators of the parameters by choosing y, 6), $7,...,, to minimize 
the sum of squares: 
T 
~2 
Y #7 (8.19) 
t=p+l1 


where #, are the prediction residuals defined by 
Ui, = 0,0.) =y, — 7 — BiYy-1 — G22 — 0 + PoYt—-p- (8.20) 


This is equivalent to applying ordinary LS to the linear regression model 


y=Gf+u (8.21) 
where 
y= Opp Ypp2e e+ V7) 
uw = (Up41> Unyaeeees ur) (8.22) 
B= (6,1) = G1 ba, Gps V) 
and 
Yop Mp-t tts 1 
gape yy I . (8.23) 
Yr-1 Yr-2 °** Yr-p 1 


This form of LS estimator may also be written as 
B= ().0>,....by.9) = (GG) 'G'y (8.24) 


and the mean value estimator can be computed as 


p= (8.25) 
An alternative approach is to estimate 4 by the sample mean y and compute the 


LS estimator of @ as ; 
b = (G"G*)'G* y*, (8.26) 
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where y* is now the vector of centered observations y, — y, and G* is defined as in 
(8.23), but replacing the y/s by the y* and omitting the last column of ones. 
Unfortunately, the above forms of the LS estimator do not ensure that the esti- 
mators p= (by, bo, a by ) correspond to a stationary autoregression; that is, it can 
happen that one or more of the roots of the estimated characteristic polynomial 


b2 =1-¢)2-G)2-... — o,2 


lie inside the unit circle. A common way around this is to use the so-called 
Yule—Walker equations to estimate @. Let C(/) be the autocovariance (8.1) at lag J. 
The Yule—Walker equations relating the autocovariances and the parameters of an 
AR(p) process are obtained from (8.18) as 


Dp 
CK) = DY b,Ck-i) (k= Dd. (8.27) 


i=1 


For | < k < p, (8.27) may be expressed in matrix equation form as 
Co =g (8.28) 


where g’ = (C(1), C(2),...,C(p)) and the pxp matrix C has elements Ci = 
Ci —). It is left for the reader (Problem 8.5) to verify the above equations for an 
AR(p) model. 

The Yule—Walker equations can also be written in terms of the autocorrelations as 


Pp 
alk) = ¥" biolk-i) (k= D. (8.29) 
i=] 


The Yule—Walker estimator Ove of ¢ is obtained by replacing the unknown lag-/ 
covariances in C and g of (8.28) by the covariance estimators 


T-|l| 


CH= 4 7 2, Oni YO, -y) (8.30) 


and solving for — It can be shown that the above lag-/ covariance estimators are 
biased and that unbiased estimators can be obtained under normality by replacing the 
denominator T by T — |/| — 1. However, the covariance matrix estimator C based on 
the above biased lag-/ estimators is preferred since it is known to be positive definite 
(with probability 1), and furthermore the resulting Yule—Walker parameter estimator 
byw corresponds to a stationary autoregression (see, for example, Brockwell and 
Davis, 1991). 

The Durbin—Levinson algorithm, to be described in Section 8.2.1, is a convenient 
recursive procedure to compute the sequence of Yule—Walker estimators for AR(1), 
AR(2), and so on. 
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8.2.1. The Durbin-Levinson algorithm 


We shall describe the Durbin—Levinson algorithm, which is a recursive method 
to derive the best memory-(m — 1) linear predictor from the best memory-(m — 1) 
predictor. 
Let y, be a second-order stationary process. It will be assumed that Ey, = 0. 
Otherwise, we apply the procedure to the centered process y, — Ey, rather than y,. 
Let 


Vion = PmAdt-1 Aspe ch PmmYt—m (8.31) 
be the minimum mean-square error (MMSE) memory-m linear predictor of y, based 
ON Y,_1,--->Y;-m: The “diagonal” coefficients ¢,,, ,, are the so-called partial autocor- 


relations, which are very useful for the identification of AR models, as will be seen 
in Section 8.7. 
The @,,, ,, Satisfy 


mm 


IPnml <1, (8.32) 


except when the process is deterministic. 
The MMSE forward prediction residuals are defined as 


tt, m= Yt Pm 1-1 TT ee Pin 1—m- (8.33) 
The memory-m backward linear predictor of y, — that is, the MMSE predictor of y, 
as a linear function of y,,1,..- Yj — can be shown to be 


Jy = Pm Ader tee + PmnmYi+m> 
and the backward MMSE prediction residuals are 
ae =): > PmAidr+l ah say PmmYt+m- (8.34) 


Note that 7,,,,, and re , are both orthogonal to the linear space spanned by 


Yi-1>+++>Yt-m41» With respect to the expectation inner product; that is, 
Etiism—1V1—k = EW ym i-k = 9, k= 1,...,m—1. 


We shall first derive the form of the memory-m predictor assuming that the 
true memory-(m — 1) predictor and the true values of the autocorrelations p(k), 
k=1,...,m; are known. 


Let C*u* be the MMSE linear predictor of i, ,,_; based on u* . Then 
t-m,m—1 tm t—m,m—-1 
A 2 : A 2 
EQ m-=1 ~ nee) = aa EQ, m1 Cus mm—1) : (8.35) 


It can be proved that the MMSE memory-m predictor is given by 


* 
t—m,m—1 


= (Pm-1,1 - OP mA Tr... 
+ (bn—-13 ~ Pin Venmel + as ae (8.36) 


A LA * 
Yim —_ Ytm-1 + ¢ u 
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To show (8.36) it suffices to prove that 
El, =n ~ "us mm) =i) =0, i=1,...,m, (8.37) 


which we leave to the reader as Problem 8.8. Then from (8.36) we have 


_ je if i=m 

Pini — { Pmn-1i = <a eee if l<i<m-l. (8.38) 

According to (8.38), if we already know @ 

the ¢,,,;, we only need ¢* = 
It is easy to show that 


mir | Si <m— 1, then to compute all 


mm* 
— ar * 
Pin = Corr(tt, n—1 Ur_mm-V 
m-| . 
= p(m) _ 9 ear p(m _ DP m-1i 
~ m-—1 . . 
1- pee POP m—14 


The first equality above justifies the term “partial autocorrelation”: it is the correlation 


(8.39) 


between y, and y,_,, after the linear contribution of y,; (i =f—1,...,t-— m+ 1) has 
been subtracted out. 
If y, is a stationary AR(p) process with parameters $),...,¢,, it may be shown 
(Problem 8.11) that 
dpi =%i 1SiS<p~, (8.40) 
and 
Pnm =9, m> p. (8.41) 


In the case that we have only a sample from a process, the unknown autocorrela- 
tions p(k) can be estimated by their empirical counterparts, and the Durbin—Levinson 
algorithm can be used to estimate the predictor coefficients @,, ; in a recursive way. In 
particular, if the process is assumed to be AR(p), then the AR coefficient estimators 
are obtained by substituting estimators in (8.40). 


It is easy to show that $, ; = p(1) or, equivalently, 
$11 = argmin E(y, ~ Cy,-1)", (8.42) 


We shall now describe the classical Durbin—Levinson algorithm in such a way as 
to clarify the basis for its robust version, as given in Section 8.6.4. 
The first step is to set 6, ; = p(1), which is equivalent to 


T 
$1. = argmin Y’ (y, ~ 9-1) (8.43) 
t=2 


Assuming that estimators ¢,,_,; of ,,_;;, for 1<i<m-—1, have already been 


n 


computed, @,, ,, can be computed from (8.39), where the ps and ¢s are replaced by 


mm 
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n 


their estimators. Alternatively, @,,,,,, is obtained as 


mm 


Ee 
Fm = argmin YY Bn(S), (8.44) 


t=m+1 
with 
Wink) =~ Vii = Cur mm-t 
=) (Dm-1.1 ~ CPmn—1m-D 1-1 eG 
~ (Yn 1,m—-1 ~~ Chin vt m+1 ~~ CV—m (8.45) 


and where the backward residuals u* — m,m — 1 are computed here by 


. = A A 
Um m-1 = Vi-m — Pm—1 1 t—-m+1 See Pint main 


n 


The remaining @,, ; are computed using the recursion (8.38). It is easy to verify that 


this sample Durbin—Levinson method is essentially equivalent to obtaining the AR(m) 
estimator @,, by solving the Yule—Walker equations for m = 1,2,..., p. 


8.2.2 Asymptotic distribution of classical estimators 


The LS and Yule—Walker estimators have the same asymptotic distribution, which 
will be studied in this section. Call 4 = ($,.5,....,.f2) the LS or Yule-Walker 
estimator of A = (f, hz,.-., bp, H) based on a sample of size T. Here ican be either 
the sample mean estimator or the estimator of (8.25) based on the LS estimator B 
defined in (8.19). It is known that A converges in distribution to a (p + 1)-dimensional 
multivariate normal distribution 


VT(A = 4) +g Np41(0, V5) (8.46) 


where the asymptotic covariance matrix V;5 is given by 


Vis¢ 0’ 
¥ug= (8.47) 
= | 0 Vis 
with 
PY) 
i cl (8.48) 
(1 Zs a) 
and 
Vise = Vist) = oC, (8.49) 
where C is the pXp covariance matrix of (y,_,..., Vep)s which does not 


depend on t (due to the stationarity of y,), and o2C~! depends only on the AR 
parameters @; see, for example, Anderson (1994) or Brockwell and Davis (1991). 
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In Section 8.15 we give a heuristic derivation of this result and the expression for 
= Cle. 


Remark: Note that if we apply formula (5.6) for the asymptotic covariance matrix 
of the LS estimator under a linear model with random predictors, to the regression 
model (8.21), then the result coincides with (8.49). 


The block—diagonal structure of V shows that f@ and d = (p,.bo5° - .b,)! are 


asymptotically uncorrelated. The standard estimator of the innovations variance 


2 . 
o;, is 


or = > 4 —P = Din — Badia = = bpp)” (8.50) 


Tp P zp 1 


or alternatively 


6, = > Oy — 1-1 — DaVp-2 — -- — Bp)” (8.51) 
T=p, t=pt+l 
where Y,_; = y,; — #, i= 0, 1,...,p. It is known that 6? is asymptotically uncorre- 
lated with 4 and has asymptotic variance 
AsVar(62) = E(u*) — of. (8.52) 


In the case of normally distributed u,, this expression reduces to AsVar(6?) = 207. 

What is particularly striking about the asymptotic covariance matrix V_, is that 
the p x p submatrix V; 54 is a constant that depends only on ¢, and not at all on 
the distribution F,, of the innovations u, (assuming finite variance innovations!). This 
distribution-free character of the estimator led Whittle (1962) to use the term robust 
to describe the LS estimators of AR parameters. With hindsight, this was a rather 
misleading use of this term because the constant character of V;5 4 holds only under 
perfectly observed autoregressions; that is, with no AOs or ROs. Furthermore it turns 
out that the LS estimator will not be efficiency robust with respect to heavy-tailed 
deviations of the IOs from normality, as we discuss in Section 8.4. It should also 
be noted that the variance V,, is not constant with respect to changes in the variance 
of the innovations, and AsVar(62 ) depends upon the fourth moment as well as the 
variance of the innovations. 


8.3. Classical estimators for ARMA models 


A time series y, is called an autoregressive moving-average model of orders p and q, 
or ARMA(p, q) for short, if it obeys the stochastic difference equation 


Or =) — P10 = WY = = Ppp — HY = Opty = + = Ogltig + Mr 
(8.53) 
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where the i.i.d. innovations u, have mean 0 and finite variance 02. This equation may 
be written in more compact form as 


(B)(y, — #) = O(B)u, (8.54) 


where B is the back-shift operator; that is, By, =y,_,, and @(B) and 6(B) are 
polynomial back-shift operators given by 


o(B) = 1- $,B- $B’ —... — 6B? (8.55) 


and 
0(B) = 1-0,B-0,B° -—... — 6,B4. (8.56) 


The process is called invertible if y, can be expressed as an infinite linear 
combination of the y, for s < ¢ plus the innovations: 


Vy =U, + by Ni1-i + Y- 
i=1 
It will henceforth be assumed that the ARMA process is stationary and invertible. 
The first assumption requires that all roots of the polynomial @(B) lie outside the unit 
circle and the second requires the same of the roots of @(B). 
Let 24 = (, 9, uw) = ($1, do, ---; Pps 0,,05,..-, @,, H) and consider the sum of 


squared residuals 
T 


> i? (A) (8.57) 


t=p+l 
where the residuals 7,(A) may be computed recursively as 
0A)=6,- 2) G04 —-W) >) - 6,04 -) 
+ O,ti,_\(A) +... + Ogit,_4(A) (8.58) 
with the initial conditions 
i,(A) = t,_\(A) =... = Hy_g41(A) = 0. (8.59) 


Minimizing the sum of squared residuals (8.57) with respect to A produces an LS 
estimator 1,5 = (@ , 0, 4). When the innovations u, have a normal distribution with 
mean 0 and finite variance Ge, this LS estimator is a conditional maximum likelihood 


estimator, conditioned on y, y)...,y, and on 


Up—g+l = Up—g42 = +++ = Up-1 = Uy = O. 


See for example Harvey and Philips (1979), where it is also shown how to compute 
the exact Gaussian maximum likelihood estimator of ARMA model parameters. 
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It is known that under the above assumptions for the ARMA(p, qg) process, the 
LS estimator, as well as the conditional and exact maximum likelihood estimators, 
converge asymptotically to a multivariate normal distribution: 


VT (Ars — 2) >a Npaqe1 (0, Vis) (8.60) 


where 
D! (@, 0) 0’ 


Vis = Vis(@, 9, o;) = 0 a | 
0 


(8.61) 
with V;s,, the asymptotic variance of the location estimator fi and D(@, 8) the 
(p+q)X(p+q) asymptotic covariance matrix of (@, 6). Expressions for the 
elements of D(@, 0) are given in Section 8.15. As the notation indicates, D(@, 0) 
depends only on ¢ and 9, and so the LS estimator (@. 0) has the same distribution-free 
property as in the AR case, described at the end of Section 8.2.2. The expression 
for Vs, is 


Ou 
Vis = 2 (8.62) 
with i=e 
—,- =, 
=— : 8.63 
aa (8.63) 


which depends upon the variance of the innovations o? as well as on @ and 0. 


The conditional MLE of the innovations variance o7 for an ARMA(p, q) model 
is given by 


oY eG), (8.64) 


The estimator 6? is asymptotically uncorrelated with a is and nas the same asymptotic 
distribution as in the AR case, namely AsVar(G;; y= E(ut) _ ot 

Note that the asymptotic distribution of 1 does not depend ¢ on the distribution of 
the innovations, and hence the precision of the estimators does not depend on their 
variance, as long as it is finite. 

A natural estimator of the variance of the estimator j# is obtained by plugging 
parameter estimates into (8.62). 


8.4 M-estimators of ARMA models 


8.4.1 M-estimators and their asymptotic distribution 


An M-estimator dy of the parameter vector 4 for an ARMA(p, g) model is obtained 


by minimizing ; 
uA 
De (“ ‘Y, (8.65) 


t=p+1 
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where p is a p-function already used for regression in (5.7). The residuals 7%,(A) 
are defined as in the case of the LS estimator, and @ is a robust scale estimator 
that is obtained either simultaneously with 4 (e.g., as an M-scale of the 7,8 as in 
Section 2.7.2) or previously as with MM-estimators in Section 5.5). We assume that 
when T > oo, G converges in probability to a value o that is a scale parameter of the 
innovations. It is also assumed that o is standardized so that if the innovations are 
normal, o coincides with the standard deviation o,, of u,, as explained for the location 
case at the end of Section 2.5. 
Let y = p’ and assume that @ has a limit in probability o and 


Ey (=) =), (8.66) 


Note that this condition is analogous to (4.41) used in regression. Under the assump- 
tions concerning the ARMA process made in Section 8.3 and under reasonable reg- 
ularity conditions, which include that 6? = Var(u,) < 00, the M-estimator de has an 
asymptotic normal distribution given by 


VT Ay — A) 4 Noxgsi 0, Vu); (8.67) 
with 
Vir = Vulb, 807) = eV 15 (8.68) 
where a depends on the distribution F of the u,s: 
2B 2 
a=aw,F)= a Aa (8.69) 
oi (Ew'(u,/o))? 


A heuristic proof is given in Section 8.15. In the normal case, o = o,, and a coin- 
cides with the reciprocal of the efficiency of a location or regression M-estimator 
(see (4.45)). 

In the case that p(t) = — log f(#), where f is the density of the innovations, the 
M-estimator 7 is a conditional MLE, and in this case the M-estimator is asymptot- 
ically efficient. 


8.4.2 The behavior of M-estimators in AR processes 
with additive outliers 


We already know from the discussion in Sections 8.1.1 and 8.1.3 that LS estimators 
of AR models are not robust in the presence of AOs or ROs. Such outliers cause both 
bias and inflated variability of LS estimators. 

The LS estimator (8.19) proceeds as an ordinary regression, where y, is regressed 
on the “predictors” y,_),...,;-p- Similarly, any robust regression estimator based on 
the minimization of a function of the residuals can be applied to the AR model, in 
particular the S-, M- and MM-estimators defined in Chapter 5. In order to obtain some 
degree of robustness, it is necessary, just as in the treatment of ordinary regression in 
that chapter, that p be bounded, and in addition a suitable algorithm must be used to 
help ensure a good local minimum. 
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This approach has the advantage that existing software for regression can 
be readily used. However, it has the drawback that if the observations y, are 
actually an AR(p) process contaminated with an AO or RO, the robustness of the 
estimators decreases with increasing p. The reason for this is that in the estimation 
of AR(p) parameters, the observation y, is used in computing the p+ | residuals 
U7, b), Ui41(Y, ®), «+++ 4p(y,@). Each time an outlier appears in the series it may 
spoil p + 1 residuals. In an informal way, we can say that the BP of an M-estimator 
is not larger than 0.5/(p + 1). Correspondingly, the bias due to an AO can be quite 
high and one expects only a limited degree of robustness. 


Example 8.2 Simulated AR(3) data with AO. Tables for this example are obtained 
with script ar3.R. 


To demonstrate the effect of contamination on these estimators, we generated 
T = 200 observations x, from a stationary normal AR(3) model with o,, = 1, y = 0 
and @ = (8/6, —5/6, 1/6)’. We then modified k evenly spaced observations by adding 
four to each, for k = 10 and 20. Table 8.1 shows the results for LS and for the MM 
regression estimator with bisquare function and efficiency 0.85 (script AR3). 


It is seen that the LS estimator is much affected by 10 outliers. The MM-estimator 
is similar to the LS estimator when there are no outliers. It is less biased and so better 
than LS when there are 10 outliers, but it is highly affected by them when there are 20 
outliers. The reason is that in this case the proportion of outliers is 20/200 = 0.1, which 
is near the heuristic BP value of 0.125 = 0.5/(p + 1), as discussed in Section 8.4.2. 


8.4.3. The behavior of LS and M-estimators for ARMA 
processes with infinite innovation variance 


The asymptotic behavior of the LS and M-estimators for ARMA processes has been 
discussed in the previous sections under the assumption that the innovations u, have 


Table 8.1 LS and MM-estimators of the parameters of a AR(3) simulated process 


Estimator #(outliers) dp) p> 3 y om 
LS 0 1.41 —0.92 0.21 0.00 0.99 
10 0.74 —0.09 —0.14 0.10 1.68 
20 0.58 —0.01 —0.16 0.24 2.02 
MM 0 1.39 —0.93 0.23 —0.02 0.91 
10 1.12 -0.51 0.04 —0.03 1.16 
20 0.56 0.02 —0.04 —0.19 1.61 


True values 1.333 —0.833 0.166 0.00 1.00 
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finite variance. When this is not true, it may be surprising to know that under certain 
conditions the LS estimator not only is still consistent, but also converges to the true 
value at a faster rate than it would under finite innovation variance, with the consis- 
tency rate depending on the rate at which P(|u,| > k) tends to zero as k > oo. 

For the case of an M-estimator with bounded y, and assuming that a good robust 
scale estimator G is used, a heavy-tailed f can lead to ultra-precise estimation of the 
ARMA parameters (@, @) (but not of j), in the sense that VT (ob —) ae 0 and VT (6 
—0) —,, 0. This fact can be understood by noting that if u, has a heavy-tailed distribu- 
tion, such as the Cauchy distribution, then the expectations in (8.69) and o are finite, 
while o,, is infinite. 

To make this clear, consider fitting an AR(1) model. Estimating @ is equivalent 
to fitting a straight line to the lag-1 scatterplot of y, against y,_;. Each IO appears 
twice in the scatterplot: as y,_, and as y,. In the first case it is a “good” leverage point, 
and in the second it is an outlier. Both LS and M-estimators take advantage of the 
leverage point. But the LS estimator is affected by the outlier, while the M-estimator 
is not. 

The main LS results were derived by Kanter and Steiger (1974), Yohai and 
Maronna (1977), Knight (1987) and Hannan and Kanter (1977) for AR processes, 
and by Mikosch ef al. (1995), Davis (1996) and Rachev and Mittnik (2000) in the 
ARMA case. Results for monotone M-estimators were obtained by Davis ef al. 
(1992) and Davis (1996). 

The challenges of establishing results in time series with infinite-variance inno- 
vations has been of great interest to academics and has resulted in many papers, 
particularly in the econometrics and finance literature. See, for example, applications 
to unit root tests (Samarakoon and Knight, 2005), and references therein, and appli- 
cations to GARCH models (Rachev and Mittnik, 2000), one of the most interesting 
of which is the application to option pricing by Menn and Rachev (2005). 


8.5 Generalized M-estimators 


One approach to curb the effect of “bad leverage points” due to outliers is to modify 
M-estimators in a way similar to ordinary regression. Note first that the estimating 
equation of an M-estimator, obtained by differentiating the objective function with 


respect to (7, @), is 
ia a 
Y zw (=) =0 (8.70) 
G 


t=p+l u 


where y = p’ is bounded and z, = (1, y,,),-15--+» Y-p4) 

One way to improve the robustness of the estimators is to modify (8.70) by bound- 
ing the influence of outliers in z,_, as well as in the residuals 7,(7, @). This results in 
the class of generalized M-estimators (GM-estimators), similar to those defined for 
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regression in Section 5.11.1. A GM-estimator (7, ) is obtained by solving 


T ecugel 
>, n (aoe 1.®)) Z,-1 =0 (8.71) 


t=p+l u 


where the function 7(.,.) is bounded and continuous in both arguments (say, of 
Mallows or Schweppe type, as defined in Section 5.11.1) and @ is obtained from a 
simultaneous M-equation of the form 


lw (2G.b) \_ 
13 (222) = 6, (8.72) 


i=l u 


Here \ 
dr(¥;-1) = peel Yer (8.73) 


with C an estimator of the p X p covariance matrix C of y,_; = (Y-1, Yy-25 ++ Veep)’ 

In the remark above (8.50), it was pointed out that the asymptotic distribution of 
LS estimators for AR models coincides with that of LS estimators for the regression 
model (8.21). The same can be said of GM-estimators. 

GM-estimators for AR models were introduced by Denby and Martin (1979) and 
Martin (1980, 1981), who called them bounded influence autoregressive (BIFAR) 
estimators. Bustos (1982) showed that GM-estimators for AR(p) models are asymp- 
totically normal, with covariance matrix given by the analog of the regression case 
(5.90). Ktinsch (1984) derived Hampel-optimal GM-estimators. 

There are two main possibilities for C. The first is to use the representation 
C = 02D(@) given by the matrix D in Section 8.15, where @ is the parameter vector 
of the pth order autoregression, and put C= o2D(p) in (8.73). Then d appears 
twice in (8.71): in d; and in @,. This is a natural approach when fitting a single 
autoregression of given order p. 

The second possibility is convenient in the commonly occurring situation where 
one fits a sequence of autoregressions of increasing order, with a view toward deter- 
mining a “best” order Pop. 


Let ,1,---, Px, be the coefficients of the “ best-fitting” autoregression of order 
k, given in Section 8.2.1. The autocorrelations p(1),..., (yp — 1) depend only on 
Py-1,17+++>Pp-1p—1 and can be obtained from the Yule-Walker equations by solv- 


ing a linear system. Therefore the correlation matrix R of y,_,; also depends only on 
Pp-1,15+++> Pp-1p-1- We also have that 


C=C(O)R. 


Then we can estimate & 


y,12+++> Py» Tecursively as follows. Suppose that we have 


already computed estimators ¢,_11,.--,Pp-1p-1- Then, we estimate $,),...,$yp 


by solving (8.71) and (8.72) with € = C(0)R, where 7(0) is a robust estimator of 
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Table 8.2 GM-estimators of the parameters of AR(3) 
simulated process 


#(outliers) dg, on) 3 Y Cy 


0 1.31 —0.79 0.10 O11 0.97 
10 1.15 —0.52 —0.03 0.17 1.06 
20 0.74 —0.16 —0.09 0.27 1.46 


True values 1.333 —0.833 0.166 0.00 1.00 


the variance of the y,s (say, the square of the MADN) and R is computed from the 
Yule—Walker equations using Oia oe Diipets 

Table 8.2 shows the results of applying a Mallows-type GM-estimator to the data 
of Example 8.2 (script AR3). It is seen that the performance of the GM-estimator is 
no better than that of the MM-estimator shown in Table 8.1. 


8.6 Robust AR estimation using robust filters 


In this section we assume that the observations process y, has the AO form 
y,; =X, +U,, with x, an AR(p) process as given in (8.14), with parameters 
A = (G1, b2,-- +s Pps y)'. v, is independent of x,. An attractive approach is to define 
robust estimators by minimizing a robust scale of the prediction residuals, as with 
regression S-estimators in Section 5.4.1. It turns out that this approach is not 
sufficiently robust. A more robust method is obtained by minimizing a robust scale 
of prediction residuals obtained with a robust filter that curbs the effect of outliers. 
We begin by explaining why the simple approach of minimizing a robust scale of the 
prediction residuals is not adequate. Most of the remainder of this section is devoted 
to describing the robust filtering method, the scale minimization approach using 
prediction residuals from a robust filter, and the computational details for the whole 
procedure. The section concludes with an application example and an extension of 
the method to integrated AR(p) models. 


8.6.1 Naive minimum robust scale autoregression estimators 


In this section we deal with the robust estimation of the AR parameters by minimizing 
a robust scale estimator 6 of prediction residuals. Let y,, 1 < t < T, be observations 
corresponding to an AO model y, = x, + v,, where x, is an AR(p) process. For any 
A = (),65,---> Py, Hy! € R?*, define the residual vector as 


GA) = (4,410), --- BAY, 
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where 
0A) =0, 8) 0109 = 2) Me OG Bw (8.74) 


Given a scale estimator G, an estimator of 4 can be defined by 
2 = arg min ,G(fi(A)). (8.75) 


If G is ahigh-BP M-estimator of scale we would have the AR analog of regression 
S-estimators. Boente et al. (1987) generalized the notion of qualitative robustness 
(Section 3.7) to time series, and proved that S-estimators for autoregression are qual- 
itatively robust and have the same efficiency as regression S-estimators. As happens 
in the regression case (see (5.24), estimators based on the minimization of an M-scale 
are M-estimators, where the scale is the minimum scale, and therefore all the asymp- 
totic theory of M-estimators applies under suitable regularity conditions. 

If G is a t-estimator of scale (Section 5.4.3), it can be shown that, as in the 
regression case, the resulting AR estimators have a higher normal efficiency than 
that corresponding to an M-scale. 

For the reasons given in Section 8.4.2, any estimator based on the prediction resid- 
uals has a BP not larger than 0.5/(p + 1) for AR(p) models. Since invertible MA and 
ARMA models have infinite AR representations, the BP of estimators based on the 
prediction residuals will be zero for such models. 

Section 8.6.2 shows how to obtain an improved S-estimator through the use of 
robust filtering. 


8.6.2 The robust filter algorithm 


Let y, be an AO process (8.5), where x, is a stationary AR(p) process with mean 
0, and {v,} are ii.d. and independent of {x,} with distribution (8.9). To avoid the 
propagation of outliers to many residuals, as described above, we shall replace the 
prediction residuals @,(A) in (8.74) by the robust prediction residuals 


UA) = Or — H) — OO ay — ) — -- — bp Gp — (8.76) 
obtained by replacing the AO observations y,_;,i = 1,...,p, in (8.74) by the robust fil- 
tered values X,_j,-) = X;-ip-\(A), 1 = 1,...,p, which are approximations to the values 


EQ, ly1,---+¥p)- 

These approximated conditional expectations were derived by Masreliez (1975) 
and are obtained by means of a robust filter. To describe this filter we need the 
so-called state-space representation of the x,s (see, for example, Brockwell and 
Davis, 1991), which for an AR(p) model is 


X, = M+ O(x,_, — w) + du, (8.77) 
where X, = (%), X15 +++» Xy—p +1)’ is called the state vector, d is defined by 


d=(1,0,...,0)/, w= (u,...,0’, (8.78) 
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and ® is the state-transition matrix given by 


® = py. one Py-1 dy . (8.79) 


T-1 


Here I, is the k x k identity matrix and 0, the zero vector in R*. 

The following recursions compute robust filtered vectors X, ; Which are approx- 
imations of E( x,| y,,y2,-.-,,), and robust one-step-ahead predictions Kip ,» which 
are approximations of E( x,|y1, ¥2,-.-,),—;)-At each time ¢ — 1, the robust prediction 
vectors XK, ;-1 are computed from the robustly filtered vectors K,-1 1-1 as 


Ky1 = H+ OR_1-1 — (8.80) 


Then the prediction vector Xt (A) and the AO observation y, are used to compute 
the residual 7,(A) and X, , using the recursions 


U,(A) = (y, — 1) - O'R 1y-1 -— BD (8.81) 


and 


oe 1 uA) 

Xie => Xie-1 + —m,y 7 (8.82) 
8 8; 

where s, is an estimator of the scale of the prediction residual “7, and m, is a vector. 

Recursions for s, and m, are provided in Section 8.16. Here y is a bounded y-function 

that for some constants a < D satisfies 


_ fui if lul<a 
v= {5 f la se Le2) 


It turns out that the first element of m, is ss, and hence the first coordinate 
of the vector recursion (8.82) gives the scalar version of the filter. Hence if 


Krle ai Gy. es xpi and Ri i can lee: a p+l|t i’: we have 

oe ii,(A) 

Xe = Xqr—1 + SW . ST (8.84) 

t 

It follows that 

X= Xy-1 if [d,| > bs, (8.85) 
and 

X=, if |u| < as,. (8.86) 


Equation (8.85) shows that the robust filter rejects observations with scaled abso- 
lute robust prediction residuals |u,/s,| > b, and replaces them with predicted values 
based on previously filtered data. Equation (8.86) shows that observations with |z,/s,| 
< a remain unaltered. Observations for which |u,/s,| € (a,b) are modified depend- 
ing on how close the values are to a or b. Consequently, the action of the robust filter 
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is to “clean” the data of outliers by replacing them with predictions (one-sided inter- 
polates) while leaving most of the remaining data unaltered. As such, the robust filter 
might well be called an “‘outlier-cleaner’’. 

The above robust filter recursions have the same general form as the class of 
approximate conditional mean robust filters introduced by Masreliez (1975). See also 
Masreliez and Martin (1977), Kleiner et al. (1979), Martin and Thomson (1982), Mar- 
tin et al. (1983), Martin and Yohai (1985), Brandt and Kiinsch (1988) and Meinhold 
and Singpurwalla (1989). In order that the filter Xy , be robust in a well-defined sense, 
it is sufficient that the functions yw and y(u)/u be bounded and continuous (Martin 
and Su, 1985). 

The robust filtering algorithm, which we have just described for the case of a true 
AR(p) process x,, can also be used for data cleaning and prediction based on cleaned 
data for a memory-/ predictor, 1 < / < p. Such use of the robust filter algorithm is 
central to the robustified Durbin—Levinson algorithm that we will describe shortly. 


Remark 1: Note that the filter as described modifies all observations that are far 
enough from their predicted values, including innovations outliers. But this may 
damage the output of the filter, since altering one IO spoils the prediction of the 
ensuing values. The following modification of the above procedure deals with this 
problem. When a sufficiently large number of consecutive observations have been 
corrected — that is, Siis # y, for t = f,...,f) + h — the procedure goes back to f, and 
redefines Xo fo = J; and then goes on with the recursions. 

Remark 2: Note that the robust filter algorithm replaces large outliers with pre- 
dicted values based on the past, and as such produces “one-sided” interpolated 
values. One can improve the quality of the outlier treatment by using a two-sided 
interpolation at outlier positions by means of a robust smoother algorithm. One 
such algorithm is described by Martin (1979), who derives the robust smoother as 
an approximate conditional mean smoother analogous to Masreliez’s approximate 
conditional mean filter. 


8.6.3. Minimum robust scale estimators based on robust filtering 


If G is a robust scale estimator, an estimator based on robust filtering may be 
defined as 
A = arg min 6(u(A)) (8.87) 
where u (A) = (U,4;(A),.-- ,U7(A))' is the vector of robust prediction residuals 1, 
given by (8.76). The use of these residuals in place of the raw prediction residuals 
(8.74) prevents the smearing effect of isolated outliers, and therefore will result in an 
estimator that is more robust than M-estimators or estimators based on a scale of the 
raw residuals i,. 
One problem with this approach is that the objective function G(u (A)) in (8.87) 
typically has multiple local minima, making it difficult to find a global minimum. 


ROBUST AR ESTIMATION USING ROBUST FILTERS 319 


Fortunately, there is a computational approach based on a different parameteriza- 
tion in which the optimization is performed one parameter at a time. This procedure 
amounts to a robustified Durbin—Levinson algorithm, as described in Section 8.6.4. 


8.6.4 A robust Durbin—Levinson algorithm 


There are two reasons why the Durbin—Levinson procedure is not robust: 


e The quadratic loss function in (8.35) is unbounded. 
e The residuals i, ,,(@) defined in (8.45) are subject to an outlier “smearing” effect: 
if y, is an isolated outlier, it spoils the m+ 1 residuals @,,,(P).j41m(P)>---> 


A 
U+mm(®)- 


We now describe a modification of the standard sample-based Durbin—Levinson 
method that eliminates the preceding two sources of nonrobustness. The observations 
y, are assumed to have been previously robustly centered by the subtraction of the 
median or another robust location estimator. 

A robust version of (8.31) will be obtained in a recursive way, analogous to the 
classical Durbin—Levinson algorithm, as follows. 

Let Pin-1.1> pits a be robust estimators of the coefficients @,,_1.),---. 
Pm—1n—1 Of the memory-(m — 1) linear predictor. If we knew that @,,,, = ¢, then 
according to (8.38), we could estimate the memory-m predictor coefficients as 


bmi) = Pn — 68 mim» T= 1-..,m— 1. (8.88) 


Therefore it would only remain to estimate ¢. 
The robust memory-m prediction residuals u,,,(¢) may be written in the form 


Ur m(S) =e - dma) Bic) [ie — bmm1(0) Toa 
=O 8 uae); (8.89) 
where oie 1 ), i= 1,...,m, are the components of the robust state vector esti- 
mator &(”” obtained using the robust filter (8.82), corresponding to an order-m 


t-1|t-1 
autoregression with parameters 


Bn. Gs Bn» Bn), 0): (8.90) 
Observe that 7,,,(¢) is defined as i,,,(¢) in (8.45), except for the replacement 
of y,_1-+-+);-m With the robustly filtered values 2” (£),...,°”,_.(¢). Now an 


t—-1|t-1 t—m|t-1 


outlier y, may spoil only a single residual %,,,,(¢), rather than p + 1 residuals, as in 


tm 
the case of the usual AR(p) residuals in (8.74). 

The standard Durbin—Levinson algorithm computes Pinan by minimizing the sum 
of squares (8.44), which in the present context is equivalent to minimizing the sample 
standard deviation of the w,,,(¢)s defined by (8.89). Since the %,,,(¢)s may have 


tm tm 
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outliers in the y, term and the sample standard deviation is not robust, we replace 
it with a highly robust scale estimator 6 = 6,41 (0), --++U%7_(¢)). We have thus 
eliminated the two sources of non-robustness of the standard Durbin—Levinson 
algorithm. Finally, the robust partial autocorrelation coefficient estimators é, 
m= 1,2,...,p, are obtained sequentially by solving 


mm? 


Bins = arg Mineo (Ung mS)» wake Up m(S)), (8.91) 
where for each m the values Pins i=1,...,m—1, are obtained from (8.38). This 
minimization can be performed by a grid search on (—1, 1). 7 

The first step of the procedure is to compute a robust estimator $, ; of o,, by 
means of a robust version of (8.43), namely (8.91) with m = 1, where 


u,1(0) Se > Ea 1i(G)- 


8.6.5 Choice of scale for the robust Durbin—Levinson procedure 


One possibility for the choice of a robust scale in (8.91) is to use an M-scale with a 
BP of 0.5, in which case the resulting estimator is an S-estimator of autoregression 
using robustly filtered values. However, it was pointed out in Section 5.4.1 that Héss- 
jer (1992) proved that an S-estimator of regression with a BP of 0.5 cannot have a 
large-sample efficiency greater than 0.33 when the errors have a normal distribution. 
This fact provided the motivation for using t-estimators of regression, as defined in 
equations (5.26)—(5.28) of Section 5.4.3. These estimators can give high efficiency, 
say 95%, when the errors have a normal distribution, while at the same time having 
a high BP of 0.5. The relative performance of a t-estimator versus an S-estimator 
with regard to BP and normal efficiency is expected to carry over to a large extent 
to the present case of robust AR model fitting using robustly filtered observations. 
Thus we recommend that the robust scale estimator @ in (8.91) be a t-scale defined 
as in (5.26)-(5.28), but with residuals given by (8.88) and (8.89). The examples we 
show for robust fitting of AR, ARMA, ARIMA and REGARIMA models in the 
remainder of this chapter are all computed with an algorithm that uses a t-scale 
applied to robustly filtered residuals. We shall call such estimators filtered t- (or Ft-) 
estimators. These estimators were studied by Bianco et al. (1996). 

Table 8.3 shows the results of applying a filtered c (Fr) estimator to the data 
of Example 8.2. It is seen that the impact of outliers is slight, and comparison with 
Tables 8.1 and 8.2 shows the performance of the Fr-estimator to be superior to that 
of the MM- and GM-estimators. 


8.6.6 Robust identification of AR order 


The classical approach based on Akaike’s information criterion (AIC; Akaike, 
1973, 1974a), when applied to the choice of the order of AR models, leads to 
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Table 8.3. Fr-estimators of the parameters of AR(3) 
simulated process 


#(outliers) pi py $3 7 Oy 
0 1.37 —0.89 0.22 -0.01 0.87 

10 1.43 —0.92 0.19 0.01 0.97 

20 1.37 —0.89 0.15 0.00 1.00 


True values 1.333 —0.833 0.166 0.00 1.00 


the minimization of 


T 
1 AO 2p 
AIC, = log & py u; Gn.9) + T-p 


t=p+l 


where 4, ts is the LS estimator corresponding to an AR(p) model, and %, are the 
respective residuals. The robust implementation of this criterion that is used here is 
based on the minimization of 


RAIC, = log(t? (p41 prop)» «++ pp gop))) + _ 
where Ag rob) are the filtered residuals corresponding to the Fr-estimator and T is 
the respective scale. 

As with the RFPE criterion in (5.39), we believe that it would be better to multiply 
the penalty term 2p/(T — p) by a factor depending on the choice of the scale and on 
the distribution of the residuals. This area requires further research. 


8.7. Robust model identification 


Time series autocorrelations are often computed for exploratory purposes without 
reference to a parametric model. In addition, autocorrelations are generally com- 
puted along with partial autocorrelations for use in identification of ARMA and 
ARIMA models; see for example Brockwell and Davis (1991). We know already 
from Sections 8.1.1 and 8.1.3 that additive outliers can have considerable influence 
and cause bias and inflated variability in the case of a lag-1 correlation estimator, 
and Section 8.2 indicates that additive outliers can have a similar adverse impact 
on partial autocorrelation estimators. Thus there is a need for robust estimators of 
autocorrelations and partial autocorrelations in the presence of AOs or IOs. 

While methods for robust estimation of autocorrelations and partial autocorrela- 
tions have been discussed in the literature (see, for example, Ma and Genton, 2000), 
our recommendation is to use one of the following two procedures, based on robust 
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fitting of an AR model of order p* using the robustly Fr-scale estimator, and where 
p* was selected using the robust AIC criterion described in Subsection 8.6.6: 


e Procedure A: Compute classical autocorrelations and partial autocorrelations 
based on the robustly filtered values Xy , for the AR(p*) model. 

Procedure B: Compute the theoretical autocorrelations and partial autocorrelations 
corresponding to the fitted model AR(p*). This can be done using Yule-Walker 
equations (8.29) 


* 


Pp 
p(k) = DV Pip(k-i) (k> 1) (8.92) 
i=l 


for the values of the unknown p(k), where the unknown @, are replaced by the 
estimators b;. Note that the first p* — 1 equations of the above set suffice to deter- 
mine p(k), k = 1,...,p* — 1, and that p(k) for k > p* are obtained recursively from 
(8.92). Once the autocorrelations have been computed, the partial autocorrela- 
tions can be computed using the Durbin—Levinson algorithm to estimate an AR(p) 
models with | < p < p*. 


Example 8.3 Identification of simulated AR(2) AO model. Tables and figures for 
this example are obtained with script identAR2.R. 


Consider an AO model y, = x, + v, where x, is a zero-mean Gaussian AR(2) model 
with parameters @ = (#,, >)’ = (4/3, —5/6)! and innovations variance o? = 1; and 
U, = Z,W,, where z, and w, are independent, P(z, = +1) = 0.5, and w, has the mixture 
distribution (8.6) with e = 0.1, 6, = 1 and y,, = 4. The figures for this example were 
obtained using script identAR2.R. Figure 8.7 shows two series x, and y, of length 
200 generated in this way. 


x O7 
wos 
| 
T T T T T 
0 50 100 150 200 
Time 
sof 
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| 
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Time 


Figure 8.7. Above: Gaussian AR(2) series, below Gaussian AR(2) series with 
additive outliers 
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Figure 8.8 Estimated autocorrelations and partial autocorrelations for the AR(2) 
model 


The first row of Figure 8.8 displays the autocorrelation function (ACF) and partial 
autocorrelation function (PCF) of x,, the second those of y, and the third and fourth 
rows the robust ACF and PCF corresponding to Procedures A and B respectively. We 
observe that while the ACF and PCF of x, allow the correct identification of the AR(2) 
model, those of y, do not. The robust ACF and PCF obtained according to the two 
procedures are similar to those of x, and they also lead to correct model identification. 


Example 8.4 = /dentification of a simulated MA(1) AO model. Tables and figures for 
this example are obtained with script identMA1.R. 


Let us consider now an AO model y, = x, + v, where x, is a Gaussian MA(1) process 
xX, = u, — Ou,_,. It is easy to show that the lag-k autocorrelations p(k) of x, are zero, 
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Figure 8.9 Top, Gaussian MA(1) series; bottom, Gaussian MA(1) series with addi- 
tive outliers 


except for k = 1 where p(1) = -0@/(1 + 67), and that p(1) = p(1,6) is bounded in 
magnitude by 1/2 for —1 < @ < 1. Script identMA1 generates the figures and other 
results for this example. We obtain by Monte Carlo simulation a series x, of length 
200 of an invertible Gaussian MA(1) process with @ = —0.8 and o = |. The series 
y, with AO is generated as as in Example 8.3, except that now y,, = 6 instead of 4. 
Both series x, and y, are displayed in Figure 8.9. 

Figure 8.10 displays the same four ACFs and PCFs as in Figure 8.8. The ACF 
and PCF of x, and those obtained according to Procedures A and B identify a MA(1) 
model. The ACF and PCF of y, lead instead to incorrect identification. 


8.8 Robust ARMA model estimation using 
robust filters 
In this section we assume that we observe y, = x, + v,, where x, follows an ARMA 


model given by equation (8.53). The parameters to be estimated are given by the 
vector A = (@, 8, 1) = ($1, ho, -- +, Pps 91, 92,- + + Og H)- 


8.8.1 7-estimators of ARMA models 


In order to motivate the use of Fr-estimators for fitting ARMA models, we first 
describe naive t-estimators that do not involve the use of robust filters. Assume first 
there is no contamination; that is, y, = x,. For t > 2 call Pie (A) the optimal linear 
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Figure 8.10 Estimated autocorrelations and partial autocorrelations for the MA(1) 
model 


predictor of y, based on y,,...,y,,; when the true parameter is A, as described in 
Section 8.2.1. For f = | put Syr-1A) = pw = E(y,). Then if u, are normal we also have 


Fye1(A) = EQylyy,---¥-1) t> 1. (8.93) 
Define the prediction errors as 
G,(A) = y, — Py) (8.94) 


Note that these errors are not the same as i,(A) defined by (8.58). 
The variance of @,(A), 67(A) = EQ, - Dar (A)) has the form 


o7(A) = a? (Ajo (8.95) 


u? 
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with lim,_,., a?(A) = | (see Brockwell and Davis, 1991). In the AR case we have 
a=\l1fort>p+l. 

Suppose that the innovations u, have a N(0,o2) distribution. Let L(y,,...,y7 
A, 6,,) be the likelihood, and define 


Q(A) = —2 max log L(),...,¥7,4,06,). (8.96) 


Except for a constant, we have (see Brockwell and Davis, 1991) 


Q(A) y loga?(a) + Tlog( 2 y a) (8.97) 
= oga +71 lo —_ : iS 
t=1 = Tt a; (A) 
Then the MLE of J is given by 
A = arg min Q(A). (8.98) 
Observe that < 

10 @) 

T & a;(A) 
is the square of an estimator of o,, based on the values 1#,(A)/a,(A), t = 1,..., 7. Then 


it seems natural to define a t-estimator A of A by minimizing 


T A A 
7) 2 2 (4A) ur(A) 
O*(a) = 2 log a?(A) + T log (: (a2. oo my) (8.99) 


where, for any u = (u,,..., 7)’, a t-scale estimator is defined by 
. u 
2 2 t 
tT (u) = s*(u) D po (=) (8.100) 
2 > \ sca) 


with s(u) an M-scale estimator based on a bounded p-function p,. See Section 5.4.3 
for further details in the context of regression t-estimators. 

While regression t-estimators can simultaneously have a high BP value of 0.5 
and a high efficiency for the normal distribution, the t-estimator A of A has a BP of 
at most 0.5/(p + 1)) in the AR(p) case, and is zero in the MA and ARMA cases, for 
the reasons given at the end of Section 8.6.1. 


8.8.2 Robust filters for ARMA models 


One way to achieve a positive (and hopefully reasonably high) BP for ARMA models 
with AO is to extend the AR robust filter method of Section 8.6.2 to ARMA models, 
by using a state-space representation of them. 

The extension consists of modifying the state-space representation (8.77)—(8.79) 
as follows. Let x, be an ARMA(p,qg) process and k= max(p,g+1). In 
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Section 8.17 we show that it is possible to define a k-dimensional state-space 


vector @, = (@),,---,,)' with a, =x, —, so that the following representation 
holds: 
a, = Pa,_; + du,, (8.101) 
where 
d= (1,-0,,...,-O,_1)', (8.102) 


with 0, = 0 fori > qg incase p > q. The state-transition matrix ® is now given by 


Pe-1 U- 
OD = 
P,  O-1 


and where @,_; = (@,---, by_1) and ¢; = 0 for i > p. 

Suppose now that the observations y, follow the AO process (8.5). The Masreliez 
approximate robust filter can be derived in a way similar to the AR(p) case in 
Section 8.6.2. The filter yields approximations 


(8.103) 


G1 = (CATA tees Gi xjr) and a = CATE tees Geir) 
of E(a,|y,,..-,y,) and E(a@,|y;,....y,_1) respectively. Observe that 
Kelp r= X4,(A) — @,11:(A) +H 


and 
Sgt = X21) -_ 41-10) + 


approximate E(x,|y,,...,y,) and E(x,|y,,...,,_1), respectively. 
The recursions to obtain 1 and a) ;-1 are as follows: 


1-1 -_ OG, 4):-1> 
u,(A) = y, = Fi et 4-14) — +H, (8.104) 


and 


a 1 u,(A) 
@,), = @y,-) + —Myy ( : ) (8.105) 
S; s 


ie 


Taking the first component in the above equation, adding y to each side, and using 
the fact that the first component of m, is As yields 


a uA) 
Xye = Xy-1 + SW ( : 


St 


and therefore (8.85) and (8.86) hold. 

Further details of the recursions of s, and m, are provided in Section 8.16. The 
recursions for this filter are the same as (8.80), (8.82) and the associated filter covari- 
ance recursions in Section 8.16, with X,, replaced with @,, and &,,_, replaced with 
@y1- ,- Further details are provided in Section 8.17. 
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As we shall see in Section 8.17, in order to implement the filter, a value for the 
ARMA innovation variance o? is needed as well as a value for 2. We deal with this 
issue as in the case of an AR model by replacing this unknown variance with an 
estimator 6? in a manner described subsequently. 


IOs are dealt with as described in the remark at the end of Section 8.6.2. 


8.8.3 Robustly filtered t-estimators 


A tr-estimator A based on the robustly filtered observations y, can now be obtained by 
replacing the raw unfiltered residuals (8.94) in Q*(A) of (8.99) with the new robustly 
filtered residuals (8.104) and then minimizing Q*(A). Then, by defining 


; : 
aK 2 2 (A) AM) 
Q De Donates (« (sata) 


the Fr-estimator is defined by 
A = arg min O*(A). (8.106) 


Since the above Q*(A) may have several local minima, a good robust initial esti- 
mator is required. Such an estimator is obtained by the following steps: 


1. Fit an AR(p*) model using the robust Fr-estimator of Section 8.6.3, where 
p* is selected by the robust order selection criterion — RAIC — described in 
Section 8.6.6. The value of p* will almost always be larger than p, and sometimes 
considerably larger. This fit gives the required estimator G2, as well as robust 
parameter estimators (p°, sy °,) and robustly filtered values %;,. 

2. Compute estimators of the first p autocorrelations of x, and of 7;, 1 < i < g where 

ny = aes (8.107) 


O71 


using the estimators (6°, wey £°.) and 62 

3. Finally compute the initial parameter estimators of the ARMA(p, g) model by 
matching the first p autocorrelations and the g values 74; with those obtained in 
Step 2. 


Example 8.5 Estimation of a simulated MA(1) series with AO 


As an example, we generated an MA(1) series of 200 observations with 10 equally 
spaced additive outliers as follows (script MA1-AO): 


_jxt4 if t= 207 i=1,...,10 
uF) x, otherwise 


where x, = 0.8u,_; + u, and the u, are i.i.d. N(O,1) variables. 
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Table 8.4 Estimates of the parameters of MA(1) 
simulated process 


0 H Oy 
Fr —0.80 —0.02 0.97 
Ls —0.39 0.20 1.97 


True values —0.80 0.00 1.00 


Series 


T T T 
50 100 150 200 
Index 


o-4 


Figure 8.11 Simulated MA(1) series with 10 AOs: observed (solid line) and filtered 
(dashed line) data 


The model parameters were estimated using the Fr- and the LS estimators, and 
the results are shown in Table 8.4. We observe that the robust estimate is very close 
to the true value, while the LS estimate is very much influenced by the outliers. 
Figure 8.11 shows the observed series y, and the filtered series Sag It is seen that 
the latter is almost coincident with y, except for the ten outliers, which are replaced 
by the predicted values. 


8.9 ARIMA and SARIMA models 


We define an autoregression integrated moving-average process y, of orders p, d,q 
(ARIMA(p, d, q) for short) as one such that its order-d differences are a stationary 
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ARMA(p, q) process. It therefore satisfies 
p(B) — B)’y, = 7 + O(B)u,, (8.108) 


where @ and @ are polynomials of order p and gq, and u, are the innovations. 
A seasonal ARIMA process y, of regular orders p,d,g, seasonal period s, and 
seasonal orders P, D, Q (SARIMA(p, d, g) x (P, D, Q), for short) fulfills the equation 


f(B)®(B°)(1 — B)“(1 — B’)’y, = y + O(B)O(B)u,, (8.109) 


where @ and @ are as above, and ® and © are polynomials of order P and Q 
respectively. It is assumed that the roots of @,@,@® and © lie outside the unit circle 
and then the differenced series (1 — B)4(1 — B)Py, is a stationary and invertible 
ARMA process. 

In what follows, we shall restrict ourselves to the case P = 0 and Q < 1. Then 
©(B) = 1 — ©,B and (8.109) reduces to 


(1 — By“. — BY)? p(B)y, = 7 + O(B) — ©,B')u,. (8.110) 


The reason for this limitation is that, although the Fr-estimators already defined 
for ARMA models can be extended to arbitrary SARIMA models, there is a 
computational difficulty in finding a suitable robust initial estimator for the iterative 
optimization process. At present, this problem has been solved only for P= 0 
and Q < 1. 

Assume now that we have observations y, = x, + v,, where x, fulfills an ARIMA 
model and uv, is an outlier process. A naive way to estimate the parameters is to 
difference y,, thereby reducing the model to an ARMA(p, g) model, and then apply 
the Fr-estimator already described. The problem with this approach is that the dif- 
ferencing operations will result in increasing the number of outliers. For example, 
with an ARIMA(p, |, g) model, the single regular difference operation will convert 
isolated outliers into two consecutive outliers of opposite sign (a so called “doublet’). 
However, one need not difference the data and may instead use the robust filter on 
the observations y, as in the previous section, but based on the appropriate state-space 
model for the process (8.110). 

The state-space representation is of the same form as (8.101), except that it uses 
a State-transition matrix ®* based on the coefficients of the polynomial operator of 
order p* = p+d+sD 


$*(B) = (1 — B)“(1 — BY) GB). (8.111) 


For example, in the case of an ARIMA(1, 1, g) model with AR polynomial operator 
p(B) = 1 — $B, we have 


b*(B) = 1- G5 B - $3B 
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with coefficients gf = 1+ 6, and ¢; = —d,. And for model (8.110) with p = D = 
1,d = q = Q=0 and seasonal period s = 12, we have 


$*(B) = 1- $B - $},BY — $38" 


with 


Pi a pi, Pi =1, P33 = —). 


Therefore, for each value of 1 = (f, 8,y,©,) (where ©, is the seasonal MA 
parameter when Q = 1), the filtered residuals corresponding to the operators * 
and 9* are computed, yielding the residuals 7,(2). Then the Fr-estimator is defined 
by the 2 minimizing Q*(A), with Q* defined as in (8.99) but with *(B) instead 
of f(B). 

More details can be found in Bianco et al. (1996). 


Example 8.6 Residential telephone extensions (RESEX) series. Tables and figures 
for this example are obtained with script RESEX.R. 


This example deals with a monthly series of inward movement of residential 
telephone extensions in a fixed geographic area from January 1966 to May 
1973 (RESEX). The series was analyzed by Brubacher (1974), who identified a 
SARIMA(2,0,0) x (0,1,0),. model, and by Martin et al. (1983). 

Table 8.5 displays the LS, and Fr-estimators of the parameters (script RESEX). 
We observe that they are quite different, and the estimation of the SD of the innovation 
corresponding to the LS estimator is much larger than the ones obtained with the 
Fr-estimators. 

Figure 8.12 shows the observed data y, and the filtered values 
to be almost coincident with y, except at outlier locations. 

In Figure 8.13 we show the quantiles of the absolute values of the residuals of the 
three estimators. The three largest residuals are huge and hence were not included 
to improve the graph readability It is seen that the Fr-estimator yields the smallest 
quantiles, and hence gives the best fit to the data. 


n 


Which are seen 


Table 8.5 Estimates of the parameters of RESEX series 


Estimators dp, by y o 


u 


Fr 0.27 0.49 0.41 1.12 
LS 0.48 —0.17 1.86 6.45 
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RESEX 


Figure 8.12 RESEX< series: observed (solid line) and filtered (circles) values 


Quantiles 


0.0 0.2 0.4 0.6 0.8 1.0 
Probability 


Figure 8.13 Quantiles of absolute residuals of estimators for RESEX series 
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8.10 Detecting time series outliers and level shifts 


In many situations it is important to identify the type of perturbations that the series 
undergo. In this section we describe classical and robust diagnostic methods to 
detect outliers and level shifts in ARIMA models. As for the diagnostic procedures 
described in Chapter 4 for regression, the classical procedures are based on residuals 
obtained using nonrobust estimators. In general, these procedures succeed only when 
the proportion of outliers is very low and the outliers are not very large. Otherwise, 
due to masking effects, the outliers may not be detected. 
Let y,, 1 <t < T, be an observed time series. We consider perturbed models of 
the form 
y, =x, +0 E%, (8.112) 


where the unobservable series x, is an ARIMA process satisfying 


#(B)(1 — B)“x, = 0(B)u,, (8.113) 


and the term oe represents the effect on period ¢ of the perturbation occurring at 


time fp. 
The value of @ in (8.112) measures the size of the perturbation at time f) and the 


form of et) depends on the type of the outlier. Let of?) be an indicator variable for 
(to) 


time fp (0, 


= | for t = f) and 0 otherwise). Then an AO at time f) can be modeled by 
re =o (8.114) 
and a level shift at time tf) by 


(po) J 0 if t<ft 
' “1 if t> %. 


To model an IO at time fp, the observed series y, is given by 
o(B)( — B)*y, = 6(B)(u, +o of). 
Then, for an IO we get 
E = p(B) "(1 — B)-40(B) o. (8.115) 


We know that robust estimators are not very much influenced by a small fraction 
of atypical observations in the cases of IO or AO. The case of level shifts is different. 
A level shift at period fg modifies all the observations y, with t > f). However, if the 
model includes a first-order difference, then differencing (8.112) gives 

(1-B)y, =(1-B)x, tol — BE, 
and since (1 — BE = oo the differenced series has an AO at time f). Then a robust 
estimator applied to the differenced series is not going to be influenced by a level shift. 
Therefore, the only case in which a robust procedure may be influenced by a level 
shift is when the model does not contain any difference. Note however that a second 
order difference converts a level shift to a doublet outlier. 
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8.10.1 Classical detection of time series outliers and level shifts 


In this subsection, we shall describe the basic ideas of Chang et al. (1988) for outlier 
detection in ARIMA models. Similar approaches were considered by Tsay (1988) 
and Chen and Liu (1993). Procedures based on deletion diagnostics were proposed 
by Pefia (1987, 1990), Abraham and Chuang (1989), Bruce and Martin (1989) and 
Ledolter (1991). 

For the sake of simplicity, we start by assuming that the parameters of the ARIMA 
model, A and 2, are known. 

Let z(B) be the filter defined by 


a(B) = 0(B)'@(B)(1 — B)? =1-—2,B—2,B’ -...—2,BK—.... (8.116) 


Then, from (8.113), z(B)x, = u,. Since z(B) is a linear operator, we can apply it to 
both sides of (8.112), obtaining 


n(B) y, = u, +. @ a(B) &, (8.117) 


which is a simple linear regression model with independent errors and regression 
coefficient @. 
Therefore, the LS estimator of @ is given by 


Den (#(B) y,) (x(B) &) 


@ = ————_______—__, (8.118) 
Lin CB) &? 
with variance 
PY 
Var(@) = = eee (8.119) 
De, (HB) &F)? 


where o? is the variance of u,. 

In practice, since the parameters of the ARIMA model are unknown, (8.118) and 
(8.119) are computed using LS or ML estimators of the ARIMA parameter. Let 7 
be defined as in (8.116) but using the estimators instead of the true parameters. Then 
(8.118) and (8.119) are replaced by 


Dig BE”) 


6 = ———__—_—_, (8.120) 
Ling @B) 6? 
and 
=~ e 
Tr, AB) 6? 
where 


ui, = a(B)y, 
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and 
T 


a= Fo 5 DG - BRB) & 


t=I9 


In the case of IO, the estimator of the outlier size given by (8.118) reduces to the 
innovation residual at fp; that is, @ = i,,.- 

A test to detect the presence of an outlier at a given fy can be based on the f-like 
statistic - 
a) 
py. 


a, (8.122) 
(Var(@))!/2 


Since, in general, neither fg nor the type of outlier are known, in order to decide 
if there is an outlier at any position, the statistic 


Uy = nae max{U,. ao» Ui,1vs» Yin10} 


is used, where U;, ao, U; tvs and U;,,jo are the statistics defined by (8.122) 
corresponding to an AO, level shift (LvS) and IO at time fg, respectively. If Up > M, 
where M is a conveniently chosen constant, one declares that there is an outlier or 
level shift. The time fg at which the outlier or level shift occurs and and whether the 
additive effect is an AO, IO or LvS is determined by where the double maximum 
is attained. 

Since the values 


can only be computed from a series extending into the infinite past, in practice, with 
data observed for ¢ = 1,..., 7, they are approximated by 


As mentioned above, this type of procedure may fail in the presence of a large 
fraction of outliers and/or level shifts. This failure may be due to two facts. On one 
hand, the outliers or level shift may have a large influence on the MLE, and therefore 
the residuals may not reveal the outlying observations. This drawback may be over- 
come by using robust estimators of the ARMA coefficients. On the other hand, if y,, 
is an outlier or level shift, as noted before, not only is ti, affected, but the effect of 
the outlier or level shift is propagated to the subsequent innovation residuals. Since 
the statistic Up is designed to detect the presence of an outlier or level shift at time f, 
it is desirable that Up be influenced by only an outlier at fg. Outliers or level shifts at 
previous locations, however, may have a misleading influence on Up. In the next sub- 
section we show how to overcome this problem by replacing the innovation residuals 
ui, by the filtered residuals studied in Section 8.8.3. 
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8.10.2 Robust detection of outliers and level shifts for 
ARIMA models 


In this section we describe an iterative procedure introduced by Bianco (2001) for the 
detection of AO, level shifts and IO in an ARIMA model. The algorithm is similar 
to the one described in the previous subsection. The main difference is that the new 
method uses innovation residuals based on the Fr-estimators of the ARIMA param- 
eters instead of a Gaussion MLE, and uses a robust filter instead of the filter z to 
obtain an equation analogous to (8.117). 

A detailed description of the procedure follows: 


Step 1 Estimate the parameters A and o,, robustly using an Fr—estimator. These esti- 
mators will be denoted by Zand G,, respectively. 

Step 2 Apply the robust filter described in Section 8.8.3 to y, using the estimators 
computed in Step 1. This step yields the filtered residuals u, and the scales s,. 

Step 3 In order to make the procedure less costly in terms of computing time, a pre- 
liminary set of outlier locations is determined in the following way: 


e Declare that time fp) is a candidate for an outlier or level shift location if 
[u,,1 > Ms,,. (8.123) 


where M, is a conveniently chosen constant. 
e Denote by C the set of fps where (8.123) holds. 


Step 4 For each fy € C, let z* be a robust filter similar to the one applied in Step 2, 
but such that for fy < t < fg + A the function y is replaced by the identity for a 
conveniently chosen value of h. Call u* = n*(B)y, the residuals obtained with 
this filter. Since these residuals now have different variances, we estimate w 
by weighted LS, with weights proportional to 1 / x. Then (8.120), (8.121) and 
(8.122) are now replaced by 


T mea (to) 
pan u, @(B) &; ° /s; 


6 = —~—______., (8.124) 
Teen 
Var(@) = —— (8.125) 
Lint, (B*(B) E)? /57 
and > 
yx = —@l__ (8.126) 
(Var(@))!/2 


The purpose of replacing 7 by 7* is to eliminate the effects of outliers at 
positions different from fg. For this reason, the effect of those outliers before 
f) and after f) + h is reduced by means of the robust filter. Since we want 
to detect a possible outlier at time fp, and the effect of an outlier propagates 
to the subsequent observations, we do not downweight the effect of possible 
outliers between fy and fy + h. 
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Step 5 


Step 6 


Compute 
Up 0 nee max {U7 ,AO? Uiy.LvS? U'iy10}> 
where Ur, AO? U, Lvs and Ur. jo are the statistics defined by (8.126) 


corresponding to an AO, level shift and IO at time fp, respectively. If 
U3 < M>,where M, is a conveniently chosen constant, no new outliers 
are detected and the iterative procedure is stopped. Instead, if Uj > M), a 
new AO, level shift or IO is detected, depending on where the maximum 
is attained. 

Clean the series from the detected AO, level shifts or IO by replacing y, by 
, where ee) corresponds to the perturbation at f), pointed out by 
the test. Then the procedure is iterated, going back to Step 2 until no new 
perturbations are found. 


The constant M, should be chosen rather small (for example M, = 2) to increase 
the power of the procedure for the detection of outliers. Based on simulations we 
recommend using M, = 3. 

As already mentioned, this procedure will be reliable to detect level shifts only if 
the ARIMA model includes at least an ordinary difference (d > 0). 


Example 8.7. Continuation of Example 8.5. Tables and figures for this example are 
obtained with script identMA1.R. 


On applying the robust procedure just described to the data, all the outliers were 
detected. Table 8.6 shows the outliers found with this procedure as well as their cor- 
responding type and size and the value of the test statistic. 


Table 8.6 Outliers detected with the robust procedure 
in simulated MA(1) series 


Index Type Size U, 

20 AO 4.72 7.47 
40 AO 4.46 7.51 
60 AO 4.76 8.05 
80 AO 3.82 6.98 
100 AO 4.02 6.39 
120 AO 4.08 7.04 
140 AO 4.41 7.48 
160 AO 4.74 7.95 
180 AO 4.39 7.53 


200 AO 5.85 6.59 
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Table 8.7 Detected outliers in the RESEX series 


Index Date Type Size U 

29 5/68 AO 2.52 3.33 
47 11/69 AO —1.80 3.16 
65 5/71 Lvs 1.95 3.43 
77 5/72 AO 4.78 5.64 
83 11/72 AO 52.27 55.79 
84 12/72 AO 27.63 27.16 


89 5/73 AO 4.95 3.12 


The classical procedure of Section 8.10.1 detects only two outliers: observations 
120 and 160. The LS estimators of the parameters after removing the effect of these 
two outliers are 9 = —0.48 and jf = 0.36, which are also far from the true values. 


Example 8.8 Continuation of Example 8.6. Tables and figures for this example are 
obtained with script RESEX.R. 


Table 8.7 shows the outliers and level shifts found by applying the robust procedure 
to the RESEX data. We observe two very large outliers corresponding to the last two 
months of 1972. The explanation is that November 1972 was a “bargain” month; 
that is, free installation of residential extensions, with a spillover effect, since not all 
November orders could be fulfilled that month. 


8.10.3 REGARIMA models: estimation and outlier detection 


A REGARIMA model is a regression model where the errors are an ARIMA time 
series. Suppose that we have T observations (x,,y,),...,(X7,)7), With x; € R®, 
y; € R satisfying 

Vt = p’x, + e, 


where e),...,e7 follow an ARIMA(p, d, g) model 
#(B)(1 — B)4e, = 0(B)u,. 


As in the preceding cases, we consider the situation when the actual observations 
are described by a REGARIMA model plus AO, IO and level shifts. In other words, 
instead of observing y, we observe 


yf =y, +0 &, (8.127) 


where et) is as in the ARIMA model. 
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All the procedures for ARIMA models described in the preceding sections can 
be extended to REGARIMA models. 
Define for each value of B, 


OPH ie = PX tH Leg l 


and put 
,(B) = (1- B)*2(p), t=d+1,...,T. 


When P is the true parameter, ,(B) follows an ARMA(p, q) model with an AO, IO or 
level shift. Then it is natural to define for any B and A = (@, @) the residuals 7,(B, 4) 
as in (8.58), but replacing y, by i,(B); that is, 


0i,(B, A) = 0,(B) — $,,_\(B) — ... — by ®,-p(B) + 9 %,_\(B, A) +... 
+ O,ii,_ (8,4) (t=p+d+l,...,7). 


Then the LS estimator is defined as (f, 2) minimizing 


T 
~) 
>) (B.A), 
t=p+d+1 


and an M-estimator is defined as (8, 2) minimizing 


- i,(B, A) 
eo 


t=pt+d+1 


where G is the scale estimator of the innovations u,. As in the case of regression with 
independent errors, the LS estimator is very sensitive to outliers, and M-estimators 
with bounded p are robust when the u,s are heavy tailed, but not for other types of 
outliers like AOs. 

Let 7,(B, 2) be the filtered residuals corresponding to the series €,(B) using the 
ARIMA(p, d, qg) model with parameter 2. Then we can define Fr-estimators as in 
Section 8.8.3; that is, 


(B, 4) = argmin O*(B, 4), 


where 


T ome ~w hs 
"4 7 2 2 uy, (B, a) ur(B, d) 
ee oe) + Tog (+ ( aA)" ar(A) )). 


The robust procedure for detecting the outliers and level shifts of Section 8.10.2 
can also easily be extended to REGARIMA models. For details on the Fr-estimators 
and outliers and level shift detection procedures for REGARIMA models, see Bianco 
et al. (2001). 
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8.11 Robustness measures for time series 


8.11.1 Influence function 


In all situations considered so far, we have a finite-dimensional vector A of unknown 
parameters (say, A = (),...,6,,9),---,94, Hh)’ for ARMA models) and an estima- 
tor he = AO ..., yr). When y, is a strictly stationary process, it holds under very 
general conditions that - converges in probability to a vector as which depends on 
the joint (infinite-dimensional) distribution F of {y, : f= 1,2,...}. 

Kiinsch (1984) extends Hampel’s definition (3.4) of the influence function to time 
series in the case that ie is defined by M-estimating equations that depend on a fixed 
number k of observations: 


Y Wy, Ar) = 0, (8.128) 
t=k 


where y, = (y,,---,¥,-x41)’. Strict stationarity implies that the distribution F, of y, 
does not depend on fr. Then, for a general class of stationary processes, 4,, exists and 
depends only on F;, and is the solution of the equation 


E;, W(y;, 4) = 0. (8.129) 


For this type of time series, the Hampel influence function could be defined as 


Tool (1 — €)F;, + €5y] — 1F i) 


E 


IF y(y; 4, F,) = lim (8.130) 
where y = (y;,,..-,y,)’, and the subscript H stands for the Hampel definition. Then, 
proceeding as in Section 5.11.1, itcan be shown that for estimators of the form (8.128) 
the analog of (3.48) holds. Then, by analogy with (3.29), the gross-error sensitivity 
is defined as sup, ||IFy(y; a F,,)||, where |].|| is a convenient norm. 

If y, is an AR(p) process, it is natural to generalize LS through M-estimators of 
the form (8.129) with k = p + 1. Kiinsch (1984) found the Hampel-optimal estimator 
for this situation, which turns out to be a GM-estimator of Schweppe form (5.86). 

However, this definition has several drawbacks: 


This form of contamination is not the one we would like to represent. The intu- 
itive idea of a contamination rate ¢ = 0.05 is that about 5% of the observations 
are altered. But in the definition (8.130), € is the proportion of outliers in each 
k-dimensional marginal. In general, given € and y, there exists no process such 
that all its k-dimensional marginals are (1 — €)F;, + Edy. 

e The definition cannot be applied to processes such as ARMA, in which the natural 
estimating equations do not depend on finite-dimensional distributions. 


An alternative approach was taken by Martin and Yohai (1986) who introduced a 
new definition of influence functional for time series, which we now briefly discuss. 


ROBUSTNESS MEASURES FOR TIME SERIES 341 


We assume that observations y, are generated by the general replacement 
outliers model 
Ye = (1 = 2p )x, + Wp, (8.131) 


where x, is a stationary process (typically normally distributed) with joint distribu- 
tion F’,, w, is an outlier-generating process and z; is a 0-1 process with P(z; = 1) = 
é. This model encompasses the AO model through the choice w, = x, + vu, with vu, 
independent of x,, and provides a pure replacement model when w, is independent 
of x,. The model can generate both isolated and patch outliers of various lengths 
through appropriate choices of the process z;. Assume that Dol FS) is well defined 
for the distribution F¥, of yf. Then the time series influence function IF({ Fy. zw}; 1) is 
the directional derivative at FF: 


TIFFS 054) = lim WAFS) -4,(F) (8.132) 


where F°. . ,, is the joint distribution of the processes x,, z; and w,. 

The first argument of IF is a distribution, and so in general the time series IF is 
a functional on a distribution space, which is to be contrasted with IF, which is a 
function on a finite-dimensional space. However, in practice we often choose special 
forms of the outlier-generating process w, such as constant amplitude outliers; for 
example, for AOs we may let w, = x, + v and for pure ROs we let w, = v, where v 
is a constant. 

Although the time series IF is similar in spirit to IFy, it coincides with the lat- 
ter only in the very restricted case that ae is permutation invariant and (8.131) is 
restricted to an i.i.d. pure RO model (see Corollary 4.1 in Martin and Yohai, 1986). 

While IF is generally different from IF,,, there is a close relationship between 


both: if 2 is defined by (8.128), then under regularity conditions: 


FCF, .}32) = lim oo He Pai) (8.133) 
a €10 E 
where F’,, is the k-dimensional marginal of F,, and the distribution of y, is the 
k-dimensional marginal of FY. 
The above result is proved in Theorem 4.1 of Martin and Yohai (1986), where a 
number of other results concerning the time series IF are presented. In particular: 


Conditions are established which aid in the computation of time series IFs. 

IFs are computed for LS and robust estimators of AR(1) and MA(1) models, and 
the results reveal the differing behaviors of the estimators for both isolated and 
patchy outliers. 

e It is shown that for MA models, bounded y-functions do not yield bounded IFs, 
whereas redescending y-functions do yield bounded IFs. 

Optimality properties are established for a class of estimators known as generalized 
RA estimators. 
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8.11.2 Maximum bias 


In Chapter 3 we defined the maximum asymptotic bias of an estimator 6 at a distri- 
bution F in an €-contamination neighborhood of a parametric model. This definition 
made sense for i.i.d. observations, but cannot be extended in a straightforward manner 
to time series. A basic difficulty is that the simple mixture model (1 — €)Fg + €G that 
suffices for independent observations is not adequate for time series for the reasons 
given in the previous section. 

As a simple case, consider estimation of @ in the AR(1) model x, = fx,_, + u,, 
where u, has N(0,o2) distribution. The asymptotic value of the LS estimator and 
of the M- and GM-estimators depends on the joint distribution F,,, of y,; and yp. 
Specification of F,, is more involved than the two-term mixture distribution (1 — €) 
F, + €G used in the definition of bias given in Section 3.3. For example, suppose we 
have the AO model given by (8.5), where v, is an i.i.d. series independent of x, with 
contaminated normal distribution (8.6). Denote by Nj(44;, Ho, Oo, Bes y) the bivariate 
normal distribution with means pj, and , variances oF and o; and covariance y, and 
call 02 = Var (x,) = 02/(1 — @”). Then the joint distribution F, , is a normal mixture 
distribution with four components: 


(1 — €)°N,(0, 0, 1, 1, ¢) + e(1 — €)N,(0, 0, 1 + 02, 1, d) 
+ €(1 —€)N,(0, 0, 1,1 +02, ¢) 
+€°N,(0,0, 1+ 62,1 +62, ¢). (8.134) 


The four terms correspond to the cases of no outliers, an outlier in y,, an outlier 
in yy, and outliers in both y, and y,, respectively. 

This distribution is even more complicated when modeling patch outliers in v,, 
and things get much more challenging for estimators that depend on joint distributions 
of order greater than two, such as AR(p), MA(q) and ARMA(p, g) models, where 
one must consider either p-dimensional joint distributions or joint distributions of all 
orders. 

Martin and Jong (1977) took the above joint distribution modeling approach in 
computing maximum bias curves for a particular class of GM-estimators of an AR(1) 
parameter under both isolated and patch AO models. But it seems difficult to extend 
such calculations to higher-order models, and typically one has to resort to simulation 
methods to estimate maximum bias and BP (see Section 8.11.4 for an example of 
simulation computation of maximum bias curves). 

A simple example of bias computation was given in Section 8.1.3. The asymp- 
totic value of the LS estimator is the correlation between y, and y,, and as such 
can be computed from the mixture expression (8.134), as the reader may verify 
(Problem 8.12). 

Note that the maximum bias in (8.13) is |p,(1) — @|, which depends upon the 
value of ¢, and this feature holds in general for ARMA models. 
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8.11.3. Breakdown point 


Extending the notion of BP given in Section 3.2 to the time series setting presents 
some difficulties. 

The first is how “contamination” is defined. One could simply consider the finite 
BP for observations y,,..., y7, as defined in Section 3.2.5, and then define the asymp- 
totic BP by letting T — oo. The drawback of this approach is that it is intractable 
except in very simple cases. We are thus led to consider contamination by a process 
such as AO or RO, with the consequence that the results will depend on the type of 
contaminating process considered. 

The second is how “breakdown” is defined. This difficulty is due to the fact that 
in time series models the parameter space is generally bounded, and moreover the 
effect of outliers is more complicated than with location, regression or scale. 

This feature can be seen more easily in the AR(1) case. It was seen in Section 8.1.3 
that the effect on the LS estimator of contaminating a process x, with an AO process 
v, is that the estimator may take on any value between the lag-1 autocorrelations 
of x, and v,. If v, is arbitrary, then the asymptotic value of the estimator may be 
arbitrarily close to the boundary {—1, 1} of the parameter space, and thus there would 
be breakdown according to the definitions of Section 3.2. 

However, in some situations it is considered more reasonable to take only isolated 
(that is, i.i.d.) AOs into account. In this case the worst effect of the contamination is 
to shrink the estimator toward zero, and this could be considered as breakdown if the 
true parameter is not null. One could define breakdown as the estimator approaching 
+1 or 0, but it would be unsatisfactory to tailor the definition in an ad-hoc manner to 
each estimator and type of contamination. 

A completely general definition that takes these problems into account was given 
by Genton and Lucas (2003). The intuitive idea is that breakdown occurs for some 
contamination rate €, if further increasing the contamination rate does not further 
enlarge the range of values taken on by the estimator over the contamination neigh- 
borhood. In particular, for the case of AR(1) with independent AOs, it follows from 
the definition that breakdown occurs if the estimator can be taken to zero. 

The details of the definition are very elaborate and are therefore omitted here. 


8.11.4 Maximum bias curves for the AR(1) model 


Here we present maximum bias curves from Martin and Yohai (1991) for three esti- 
mators of ¢ for a centered AR(1) model with RO: 


Y= XL -Z) +2, X, = OX) + Uy 
where z, are i.i.d. with 


PZ, = l)=y, P(w, = c) = P(w, = —c) = 0.5. 
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The three considered estimators are 


e the estimator obtained by modelling the outliers found using the procedure 
described in Section 8.10.1 (Chang et al., 1988); 

e the median of slopes estimator Med (y,/y,_,;), which, as mentioned in 
Section 5.11.2, has bias-optimality properties; 

e a filtered M-scale robust estimator, which is the same as the Fr-estimator except 
that an M-scale was used by Martin and Yohai instead of the t-scale, which is the 
approach recommended in this book. 


The curves were computed by a Monte Carlo procedure. Let br (€, c) be the value 
of any of the three estimators for sample size T. For sufficiently large T, the value 
of bre, c) will be negligibly different from its asymptotic value Po (€,c); T = 2000 
was used for the purpose of this approximation. Then the maximum asymptotic bias 
was approximated as 


Ble) = suploy(e,c) — | (8.135) 


by search on a grid of c values from 0 to 6 with a step size of 0.02. We plot the 
signed value of B(e) in Figure 8.14 for the case ¢ = 0.9. The results clearly show 
the superiority of the robust filtered M-scale estimator, which has relatively small 
bias over the entire range of € from 0 to 0.4, with estimator breakdown (not shown) 
occurring about ¢ = 0.45. Similar results would be expected for the Fr-estimator. 
The estimator obtained using the classical outlier detection procedure of Chang ef al. 
(1988) has quite poor global performance: while its maximum bias behaves similarly 
to that of the robust filtered M-scale estimator for small €, the estimator breaks down 
in the presence of white-noise contamination, with a bias of essentially —0.9 for € 
a little less than 0.1. The GM-estimator has a maximum bias behavior in between 
that of the other two estimators, with rapidly increasing maximum bias as € increases 
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Figure 8.14 Maximum bias curves (“BIF” indicates the GM-estimator) 
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beyond roughly 0.1, but is not quite broken down at ¢ = 0.35. However, one can 
conjecture from the maximum bias curve that breakdown to zero occurs by € = 0.35. 
We note that other types of bounded influence GM-estimators that use redescending 
functions can apparently achieve better maximum bias behavior than this particular 
GM-estimator (see Martin and Jong, 1977). 


8.12 Other approaches for ARMA models 


8.12.1 Estimators based on robust autocovariances 


The class of robust estimators based on robust autocovariances (RA estimators) was 
proposed by Bustos and Yohai (1986). These estimators are based on a convenient 
robustification of the estimating LS equations. 

Let A = (9, 0, y). As a particular case of the results to be proved in Section 8.15, 
the equations for the LS estimator can be reexpressed as 


T t—p-i-1 


Y aA Yaa, =0, i= 1,....7, 
t=pti+l j=0 
T t-p-i-l 
Y 40) Y GOA) =0, i= 1.2.04 
t=ptitl j=0 
and 
T 
>, H(A) = 0, (8.136) 
t=p+l 


where 7i,(A) is defined in (8.58), and Tj and c are the coefficients of the inverses of 
the AR and MA polynomials; that is, 


o\(B) = ¥ (Bi 
j=0 
and 


6 1(B) = y C(O)B’. 


j=0 
This system of equations can be written as 


T-p-i-1 


YS 7 (G)Mi4(A) = 0, 1= 1.0.47, (8.137) 
j=0 
T-p-i-1 

Y §@M,(a) =0, = 1,....4 (8.138) 


j=0 
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and (8.136), where 
T 


M(ay= )\ 2,(A)2,_(A) = 0. (8.139) 


t=ptj+1 


The RA estimators are obtained by replacing the term MA) in (8.137) and 


(8.138) with 
T 


Mra) = YY n@i,(A),7,_(A)), (8.140) 


t=p+j+1 


and (8.136) with 


T 
Dd w@(a)) = 0, 


t=p+1 


where y is a bounded w-function. 

The name of this family of estimators comes from the fact that M;/(T — j — p) 
is an estimator of the autocovariance of the residuals 7,(A), and M; "/(T-—j—p)isa 
robust version thereof. 

Two types of y-functions are considered by Bustos and Yohai (1986): 
Mallows-type functions of the form (u,v) = y*(wy*(v) and Schweppe-type 
functions of the form y(u, v) = y*(uv), where y* is a bounded y-function. The 
functions y and y* can be taken, for example, in the Huber or bisquare families. 
These estimators have good robustness properties for AR(p) models with small p. 
However, the fact that they use regular residuals makes them vulnerable to outliers 
when p is large or g > 0. They are consistent and asymptotically normal. A heuristic 
proof is given in Section 8.15. The asymptotic covariance matrix is of the form 
b(w, F)V,5, where b(y, F) is a scalar term (Bustos and Yohai, 1986). 


8.12.2 Estimators based on memory-m prediction residuals 


Suppose that we want to fit an ARMA(p, gq) model using the series y,, 1 <t< T. It 
is possible to define M-estimators, GM-estimators and estimators based on the min- 
imization of a residual scale using residuals based on a memory-m predictor, where 
m > p + q. This last condition is needed to ensure that the estimators are well defined. 

Consider the memory-m best linear predictor when the true parameter is 


A= (g, 0, pd): 
Fem(A) =H + Pni@, A -1 = H) Teer Pnm@®, AVY ~m ~ H), 


where @,,; are the coefficients of the predictor defined in (8.31) (here we call 
them g,,; rather than ¢,,; to avoid confusion with the parameter vector @). 
Masarotto (1987) proposed estimating the parameters using memory-m residuals 
defined by 

Vi m(A) =¥;—-Vizm(A), t= mt+1,...,T. 
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Masarotto proposed this approach for GM-estimators, but actually it can be used 
with any robust estimator, including MM-estimators or estimators based on the min- 
imization of a robust residual scale 


F(t 4 1 m(A), peeing Tip m(A)), 


where G is an M- or r-scale. Since one outlier spoils m + 1 memory-m residuals, the 
robustness of these procedures depends on how large the value of m is. 

For AR(p) models, the memory-p residuals are the regular residuals @,(A) given 
in (8.20), and therefore no new estimators are defined here. 

One shortcoming of the estimators based on memory-m residuals is that the 
convergence to the true values holds only under the assumption that the process y, 
is Gaussian. 


8.13 High-efficiency robust location estimators 


In Section 8.2 we described the AR(p) model in the two equivalent forms (8.14) 
and (8.18). We have been somewhat cavalier about which of these two forms to use 
in fitting the model, implicitly thinking that the location parameter p is a nuisance 
parameter that is unimportant. That being the case, there is a temptation to use a 
simple robust location estimator ff for the centering, say for # an ordinary location 
M-estimator, as described in Section 2.3. However, the location parameter may be of 
interest for its own sake, and there may be disadvantages in using an ordinary location 
M-estimator for the centering approach to fitting an AR model. 
Use of the relationship (8.17) leads naturally to the location estimator 


aw 
Ie Yi-0 gi 
It is easy to check that the same form of location estimator is obtained for 
an ARMA(p,q) model in intercept form. In the context of M-estimators or 
GM-estimators of AR and ARMA models we call the estimator (8.141) a proper 
location M- (or GM-) estimator. 

It turns out that use of an ordinary location M-estimator has two problems when 
applied to ARMA models. The first is that selection of the tuning constant to achieve 
a desired high efficiency when the innovations are normally distributed depends upon 
the model parameters, which are not known in advance. This problem is most severe 
for ARMA(p, g) models with g > 0. The second problem is that the efficiency of 
the ordinary M-estimator can be exceedingly low relative to the proper M-estimator. 
Details are provided by Lee and Martin (1986), who show that 


f= (8.141) 


e For an AR(1) model, the efficiency of the ordinary M-estimator relative to the 
proper M-estimator is between 10% and 20% for @ = +0.9 and approximately 
60% for d = +0.5. 
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For an MA(1) model, the relative efficiency is above approximately 80% for 
positive @ but is around 50% for @= —0.5 and is arbitrarily low as 0 
approaches —1. The latter was shown by Grenander (1981) to be a point of super- 
efficiency. 


The conclusion is that one should not use the ordinary location M-estimator 
for AR and ARMA processes when one is interested in location for its own sake. 
Furthermore, the severe loss of efficiency of the ordinary location M-estimator 
that is obtained for some parameter values raises doubts about its use for centering 
purposes, even when one is not interested in location for its own sake. It seems 
from the evidence at hand that it is prudent to fit the intercept form of AR and 
ARMA models, and when the location estimator is needed it can be computed from 
expression (8.141). 


8.14 Robust spectral density estimation 


8.14.1 Definition of the spectral density 


Any second-order stationary process y, defined for integer f has a spectral represen- 
tation 


1/2 
eS i exp(i2xtf )dZ(f) (8.142) 
-1/2 


where Z(f) is a complex orthogonal increments process on (—1/2, 1/2]; that is, for 
any fi <fp $< 


E{(Z(f,) — ZF) (ZG) — ZG))} = 9, 


where z denotes the conjugate of the complex number z. See, for example, Brockwell 
and Davis (1991). This result says that any stationary time series can be interpreted 
as the limit of a sum of sinusoids A; cos(2zf;t + ®;), with random amplitudes A; and 
random phases ®;. The process Z(f) defines an increasing function G( f) = E|Z(f)|’, 
with G(—1/2) = 0 and G(1/2) = o? = Var(y,). The function G(f) is called the spec- 
tral distribution function, and when its derivative S( f) = G’(f) exists it is called the 
spectral density function of y,. Other commonly used terms for S(f) are power spec- 
tral density, spectrum and power spectrum. We assume for purposes of this discussion 
that S(f) exists, which implies that y, has been centered by subtracting its mean. The 
more general case of a discrete time process on time intervals of length A is easily 
handled with slight modifications to the above (see, for example, Bloomfield, 1976 
and Percival and Walden, 1993). 

Using the orthogonal increments property of Z(f), it is immediately found that 
the lag-k covariances of y, are given by 


1/2 
C(k) = / exp(i2akf)S( f)df. (8.143) 
-1/2 
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Therefore the C(k) are the Fourier coefficients of S(f) and so we have the Fourier 
series representation 


SN= Y CW exp(-i2fk). (8.144) 


k=—0o 


8.14.2 AR spectral density 


It is easy to show that for a zero-mean AR(p) process with parameters ¢,,...,@, and 
innovations variance o? the spectral density is given by 


2 


SarpS) = AP ca (8.145) 

where F 
A(f)=1- > db; exp(i2zfk). (8.146) 

k=1 


The importance of this result is that any continuous and nonzero spectral density 
S(f) can be approximated arbitrarily closely and uniformly in f by an AR(p) spectral 
density S,,,(f) for sufficiently large p (Grenander and Rosenblatt, 1957). 


8.14.3 Classic spectral density estimation methods 


The classic, most frequently used way to estimate spectral density is a nonparamet- 
ric method based on smoothing the periodogram. The steps are as follows. Let y,, 
t=1,...,T, be the observed data, let d,, t= 1,...,7, be a data taper that goes 
smoothly to zero at both ends, and form the modified data y, = d,y,. Then use the fast 
Fourier transform (FFT; Bloomfield, 1976) to compute the discrete Fourier transform 


T 
X(f) = )) ¥, exp(—i2af,t) (8.147) 
t=1 
where f, = k/T for k = 0,1,..., [1/2]. Use the result to form the periodogram: 


Sf) = aIX(R II (8.148) 


It is known that the periodogram is an approximately unbiased estimator of S( f) for 
large T, but it is not a consistent estimator. For this reason, S(f,) is smoothed in the 
frequency domain to obtain an improved estimator of reduced variability, namely 


M 
Sh) = Y wnS Gn), (8.149) 


m=—M 


where the smoothing weights w,, are symmetric with 
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The purpose of the data taper is to reduce the so-called leakage effect of implicit 
truncation of the data with a rectangular window; originally, data tapers such as a 
cosine window or Parzen window were used. For details on this and other aspects of 
spectral density estimation, see Bloomfield (1976). A much preferred method is to 
use a prolate spheroidal taper, whose application in spectral analysis was pioneered 
by Thomson (1977). See also Percival and Walden (1993). 

Given the result in Section 8.14.2 one can also use a parametric AR(p) approx- 
imation approach to estimating the spectral density based on parameter estimators 
1, ee bs and G2; here p is an estimator of the order p, obtained through a selec- 
tion criterion such as AIC, BIC or FPE which are discussed in Brockwell and Davis 
(1991). In this case we compute 


“2 
z om 
Sarg) = —————._ a (8.150) 


a n 


|! —y?_, deexp(i2nfk) 


on a grid of frequency values f = fy. 


8.14.4 Prewhitening 


Prewhitening is a filtering technique introduced by Blackman and Tukey (1958), in 
order to transform a time series into one whose spectrum is nearly flat. One then 
estimates the spectral density of the prewhitened series, with a greatly reduced impact 
of leakage bias, and then transforms the prewhitened spectral density back, using the 
frequency domain equivalent of inverse filtering, in order to obtain an estimator of 
the spectrum for the original series. Tukey (1967) says: 


If low frequencies are 10°, 10*, or 10° times as active as high ones, a not infrequent 
phenomenon in physical situations, even a fairly good window is too leaky for comfort. 
The cure is not to go in for fancier windows, but rather to preprocess the data toward 
a flatter spectrum, to analyze this prewhitened series, and then to adjust its estimatord 
spectrum for the easily computable effects of preprocessing. 


The classic (nonrobust) way to accomplish the overall estimation method is to use 
the following modified form of the AR spectrum estimator (8.150): 


a Soa(h) 
Sarp(f) = t (8.151) 


p% 2 
[1 — Dhar be expli2api 


where Saal f) is a smoothed periodogram estimator as described above, but applied 
to the fitted AR residuals ii, = y, — $y, —--. — &sy,-p- The estimator Sap a(f) 
provides substantial improvement on the simpler estimator Spal J) in (8.150) by 
replacing the numerator estimator 6? that is fixed, independent of frequency, with the 
frequency-varying estimator S,.a( f). The order estimator p may be obtained with an 
AIC or BIC order selection method (the latter is known to be preferable). Experience 
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indicates that use of moderately small fixed orders p, in the range from two to six will 
often suffice for effective prewhitening, suggesting that automatic order selection will 
often result in values of p in a similar range. 


8.14.5 Influence of outliers on spectral density estimators 


Suppose the AO model y, = x, + v, contains a single additive outlier v, of size A. 


Then the periodogram Sp based on the observations y, will have the form 


x ee 2 
8,4) = Shy) +] + 2ZREAX A) exp(2a fe) (8.152) 


where So is the periodogram based on the outlier-free series x, and Re denotes the 
real part. Thus the outlier causes the estimator to be raised by the constant amount 
A*/T at all frequencies, plus the amount of the oscillatory term 


REX) exp(i2zf,)] 


that varies with frequency. If the spectrum amplitude varies over a wide range with 
frequency, the effect of the outlier can be to obscure small but important peaks (corre- 
sponding to small-amplitude oscillations in the x, series) in low-amplitude regions of 
the spectrum. It can be shown that a pair of outliers can generate an oscillation whose 
frequency is determined by the time separation of the outliers, and whose impact can 
also obscure features in the low-amplitude region of the spectrum (Problem 8.13). 

To get an idea of the impact of AOs more generally, we focus on the mean and 
variance of the smoothed periodogram estimators SU.) under the assumption that x, 
and v, are independent, and that the conditions of consistency and asymptotic nor- 
mality of SU.) hold. Then for moderately large sample sizes, the mean and variance 
of 5G) are given approximately by 


ES(f,) = Sy) = Sef) + Syd (8.153) 


and 
Var(S(f,)) = Sf)” = Sf)? + Spf” + 25,(F Sy fid: (8.154) 


Thus AOs cause both bias and inflated variability of the smoothed periodogram esti- 
mator. If v, is i.i.d. with variance o2, the bias is just o7 and the variance is inflated by 
the amount o? + 2S,(f,)o2. 

Striking examples of the influence that outliers can have on spectral density esti- 
mators were given by Kleiner ef al. (1979) and Martin and Thomson (1982). The most 
dramatic and compelling of these examples is in the former paper, where the data 
consist of 1000 measurements of diameter distortions along a section of an advanced 
wave-guide designed to carry over 200,000 simultaneous telephone conversations. In 
this case, the data are a “space” series but can be treated in the same manner as a time 
series as far as spectrum analysis is concerned. Two relatively minor outliers due to a 
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Figure 8.15 Wave-guide data: diameter distortion measurements against distance 


malfunctioning of the recording instrument, and not noticeable in simple plots of the 
data, obscure important features of a spectrum having a very wide dynamic range (in 
this case the ratio of the prediction variance to the process variance of an AR(7) fit 
is approximately 10~°!). Figure 8.15 (from Kleiner er al., 1979) shows the diameter 
distortion measurements as a function of distance along the wave-guide, and points 
out that the two outliers are noticeable only in a considerably amplified local section 
of the data. Figure 8.16 shows the differenced series (a “poor man’s prewhitening”’), 
which clearly reveals the location of the two outliers as doublets; Figure 8.17 shows 
the classic periodogram-based estimator (dashed line) with the oscillatory artifact 


0.2 T T T T T T T 
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Figure 8.16 Wave-guide data: differenced series 
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Figure 8.17 Wave-guide data: classical (-- - - - - ) and robust (— -) spectra 


caused by the outliers, along with a robust estimator (solid line) that we describe 
next. Note in the latter figure that the classic estimator has an eight-decade dynamic 
range, while the robust estimator has a substantially increased dynamic range of close 
to eleven decades, and reveals features that have known physical interpretations that 
are totally obscured in the classical estimator; see Kleiner et al. (1979) for details. 


8.14.6 Robust spectral density estimation 


Our recommendation is to compute robust spectral density estimators by robustify- 
ing the prewhitened spectral density (8.151) as follows. The AR parameter estima- 
tors p;, 2,.-.,b5 and 6? are computed using the Fr-estimator, and p is computed 
using the robust order selection method of Section 8.6.6. Then, to compute a robust 
smoothed spectral density estimator Sj (f), the nonrobust residual estimators 


n 


Uy = Vp — Pip — ee - P53 
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are replaced by the robust residual estimators, defined as 


~* n Seas ees TA 
Uy = Xp — PiXp—1\1-1 = PrX)-2\1-1 SH gS P5X1Blt-1> 
where Xpciip=is i=0,1,...,p, are obtained from the robust filter. Note that these 


robust prediction residuals differ from the robust prediction residuals 1%, (8.76) 
in Section 8.6.2 in that the latter have y, — ~ where we have Kye: We make this 
replacement because we do not want outliers to influence the smoothed periodogram 
estimator based on the robust residuals. Also, we do not bother with an estimator 
of 4 because, as mentioned at the beginning of the section, one always works with 
de-meaned series in spectral analysis. 

Note that our approach in this chapter — of using robust filtering — results in 
replacing outliers with one-sided predictions based on previous data. It is quite 
natural to think about improving this approach by using a robust smoother, as 
mentioned at the end of Section 8.6.2. See Martin and Thomson (1982) for the 
algorithm and its application to spectral density estimation. The authors show, using 
the wave-guide data, that it can be unsafe to use the robust filter algorithm if the 
AR order is not sufficiently large or the tuning parameters are changed somewhat, 
while the robust smoother algorithm results in a more reliable outlier interpola- 
tion and associated spectral density estimator; see Martin and Thomson (1982, 
Figs 24-27). 

Kleiner et al. (1979) also show good results for some examples using a pure robust 
AR spectral density estimator; that is, the robust smoothed spectral density estimator 


a(S) 1s replaced with a robust residuals variance estimator G2 and a sufficiently 
high AR order is used. Our feeling is that this approach is only suitable for spectrum 
analysis contexts where the user is confident that the dynamic range of the spectrum 
is not very large, at most two or three decades. 

The reader interested in robust spectral density estimation can find more details 
and several examples in Kleiner et al. (1979) and Martin and Thomson (1982). Martin 
and Thomson (1982, Section II) point out that small outliers may not only obscure 
the lower part of the spectrum but also may inflate innovation variance estimators by 
orders of magnitude. 


8.14.7 Robust time-average spectral density estimator 


The classic approach to spectral density estimation described in Section 8.14.3 
reduces the variability of the periodogram by averaging periodogram values in 
the frequency domain, as indicated in (8.149). In some applications with large 
amounts of data, it may be advantageous to reduce the variability by averaging 
the periodogram in the time domain, as originally described by Welch (1967). The 
idea is to break the time series data up into M equal-length contiguous segments of 
length NV, compute the periodogram ae ti) = = 1X (f,)|? at each frequency f, = k/N 


m 
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on the mth segment, and at each f, form the smoothed periodogram estimator 


M 
5h) = Y Sli (8.155) 
m=1 


The problem with this estimator is that even a single outlier in the mth segment can 
spoil the estimator a (f,), as discussed previously. One way to robustify this estimator 
is to replace the sample mean in (8.155) with an appropriate robust estimator. One 
should not use a location M-estimator that assumes a symmetric nominal distribution 
for the following reason. Under normality, the periodogram may be represented by 
the approximation Pm 

Sint) © SEY (8.156) 


where Y is a chi-squared random variable with two degrees of freedom and s, = 
ES, (fo) » S(f,) for large T. Thus estimation of S(f;,) is equivalent to estimating the 
scale of an exponential distribution. 

Under AO- or RO-type outlier contamination, a reasonable approximate model 
for the distribution of the periodogram 5G is the contaminated exponential 
distribution 

(1 — €)Ex(s;) + €Ex(s,,), (8.157) 


where Ex(q) is the exponential distribution with mean a. Here outliers may result in 
Soe > Sg, at least at some frequencies f,. Thus the problem is to find a good robust 
estimator of s, in the contaminated exponential model (8.157). It must be kept in 
mind that the overall data series can have a quite small fraction of contamination and 
still influence many of the segment estimators SG). and hence a high BP estimator 
of s, is desirable. Consider a more general model of the form (8.157), in which the 
contaminating distribution Ex(s,,) is replaced with the distribution of any positive 
random variable. As mentioned in Section 5.2.2, the min—max bias estimator of scale 
for this case is very well approximated by a scaled median (Martin and Zamar, 1989) 
with scaling constant (0.693)! for Fisher-consistency for the nominal exponential 
distribution. Thus it is recommended to replace the nonrobust time-average estimator 
(8.155) with the scaled median estimator 


Sf) = ——- Med{S,,,(f,.),m = 1,...,M}. (8.158) 
This estimator can be expected to work well in situations where less than half of the 
time segments of data contain influential outliers. 

The idea of replacing the sample average in (8.155) with a robust estimator of the 
scale of an exponential distribution was considered by Thomson (1977) and discussed 
by Martin and Thomson (1982), with a focus on using an asymmetric truncated mean 
as the robust estimator. See also Chave et al. (1987) for an application. 
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8.15 Appendix A: Heuristic derivation of the 
asymptotic distribution of M-estimators 
for ARMA models 


To simplify, we replace G in (8.65) by its asymptotic value o. Generally, 6 is calibrated 
so that when u, is normal, o” = o? = Eu?. Differentiating (8.65) we obtain 


Z 2,(A) \ ait,a) 
> v( ~ mL (8.159) 


t=p+1 


We leave it as an exercise (Problem 8.9) to show that 


OUi,(A) l-@g,-... —, 
=- a 
Ou 1-6,-...—6," oy 
du, (A) “lar 
0, = —p|(B)ii,_(A) (8.161) 
and ad 
GUND. 6-'(BYai,_(A). (8.162) 
00; ij 
Let 
eB) _ 0ii,(A) 
i ? byt 9 ? 
dA A=Ap On” | A=’ 


where A, is the true value of the parameter. Observe that 
Z=(e,0 ey 
with & defined in (8.63) and 
c, = -('(B)u,_1,-. 6 '(Buy_p)’, 
d, = (07'(B)uj_1, ..., 07 (B)uy_g)’- 
Since H,(Ap) = u,, a first-order Taylor expansion yields 


> v(Sae(2 y w' (=) aa + y v()w,) @-29=0 


t=p+1 t=p+l 


cE 
A 1 u 
1/2¢4 _ ~ Bf — _ 
T'2Q — A) <B (tn Y »()s) (8.163) 


t=pt+ 
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with i 
1 (“4% , Uy; 
B= — wy (=) am + =) w, 
oT ; py 1 oO py \ oO 
We shall show that 
. _ 1 , Uu; A 
p lim p = —Ey’ ( — ) Ez,z,, (8.164) 
T+00 (oy oO 
and that 


Fa yy (= SY 2, FaNpaget (0 Ey (+ +) Ba) (8.165) 


From (8.163), (8.164) and (8.165), we get 


TA — Ay) +a Npaqei0s Vy), (8.166) 
where : ; 
E L 
Vu = ov ly a aly (8.167) 
(Ey’(u,/o)) 


It is not difficult to show that the terms on the left-hand side of (8.164) are uncor- 
related and have the same mean and variance. Hence it follows from the weak law of 
large numbers that 


T 
1 (4% ! U; 
Pu oe (uv! (=) a2 +w (=) w,) 


t=p+1 


= [Ze(w (Z)am)+8(¥(F)™)]. 
o o o 
Then, (8.164) follows from the fact that u, is independent of z, and W, and 
(8.66). 

Recall that a sequence of random vectors q,, € R converges in distribution to 
N,(0, A) if and only if each linear combination a’q,, converges in distribution to 
N(O, a’Aa) (Feller, 1971). Then, to prove (8.165) it is enough to show that for any 
ac Reta 


T 
F ae 1 2a NO, vo), (8.168) 
where 
H, = w(u,/o)a'z, 


and 


vy = E(H,)? = al (ev(# y" E(z,z a 
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Since the variables are not independent, the standard central limit theorem cannot 
be applied. However, it can be shown that the stationary process H, satisfies 


E(A,|A,_1,....H,) =90 a.s. 


and is hence a so-called martingale difference sequence. Therefore, by the central 
limit theorem for martingales (see Theorem 23.1 of Billingsley, 1968) (8.168) holds, 


and hence (8.165) is proved. 
We shall now find the form of the covariance matrix Vj. Let 


ob '(B)u, = by TMi 
i=0 


and 


67"(B)u, = by CiUy—i> 
i=0 


where zy = €) = 1. We leave it as an exercise (Problem 8.10) to show that E(z,z/) has 
the following form: 


2 
a0; 0 } (8.169) 


E(z,z;) aa 0 & 


where D = D(@, @) is a symmetric (p + qg) matrix with elements 


co 
Di = > MA; WE i<j<p 
k=0 


oo 


Ding = > CMe j-i if i <p, J < qd, i<j 
k=0 


ive) 
inti = Dy Meri if iS Pi Sa J Si 
k=0 


co 
Dpripsi = > Cxony-i if i<j <q. 
k=0 


Therefore, the asymptotic covariance matrix of A is 


_ PEytu,/oy? [o7?D! 0 
“ (Ey'(u,/0)? E c* | 


In the case of the LS estimator, since y(u) = 2u and w'(u) = 2, we have 


oy (u,/o) _ > _ 29 
(Ey'(u,/ayp 
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and hence the asymptotic covariance matrix is 


DD! 0 
Vis = lo el: 


In consequence we have 


_ &Ey(u,/o) 
 e(Ey'(u,/o)y 


In the AR(p) case, the matrix o 


(Vp Yp-1>-+ +» Ye-p41) used in (8.49). 


D coincides with the covariance matrix C of 


8.16 Appendix B: Robust filter covariance recursions 


The vector m, appearing in (8.82) is the first column of the covariance matrix of the 
state prediction error x, re — X! 


M, = ER.) — X)@p-1 — X,)! (8.170) 


and 
sea /M,1, = Jim (8.171) 


is the standard deviation of the observation prediction error y,— $y,-1 = Y; — Xie I 
The recursion for M, is 
M, = ®P,_,® + odd’, (8.172) 


where P, is the covariance matrix of the state-filtering error Ky =x 
P, = ER, — XK, — x (8.173) 


The recursion equation for P, is 


1 U, , 
P,=M,- >W{ — } mm, 
Sy S; 
where W(u) = w(u)/u. 

Reasonable initial conditions for the robust filter are Xoo = (0,0,...,0)’, 
and Py = P,, where P, is a p Xp robust estimator of the covariance matrix for 
Ot Yi-29 +++ Yinp) 

When applying the robust Durbin—Levinson algorithm to estimate an AR(p) 
model, the above recursions need to be computed for each of a sequence of AR 
orders m = 1,...,p. Accordingly, we shall take o7 = o7,,,, where o7,,, is the variance 
of the memory-m prediction error of x,; that is, 


DW. do. m rT 2 
Cum = EQ, ~ Pm 1%-1—- oe Dn *t=m) B 
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Then we need an estimator oe of Oo for each m. This can be accomplished by 
using the following relationships: 


o?, =(1-¢7,)o?, (8.174) 


u,l 


where o7 is the variance of x,, and 


Onm = (1 = Pim Oe mt: (8.175) 


um mm 


In computing (8.91) for m = 1, we use the estimator an 


of o ,> parameterized 
as a function of ¢ = 1. using (8.174) 


1 


6 (b) = (1- #6? (8.176) 
where G2 is a robust estimator of o2 based on the observations y,. For example, we 
might use an M- or t-scale, or the simple estimator 6, = MADN(y,)/0.6745. Then 
when computing (8.91) for m > 1, we use the estimator a of Tis parameterized 


as a function of p = &, using (8.175) 


mm 


2 AP) = (1 — @)e? (8.177) 


usm u,m—1 


where G,,,,_; is the minimized robust scale @ in (8.91) for the order-(m — 1) fit. 
Since the function in (8.91) may have more than one local extremum, the mini- 
mization is performed by means of a grid search on (—1, 1). 


8.17 Appendix C: ARMA model state-space 
representation 


Here we describe the state-space representation, (8.101) and (8.103), for an 
ARMA(p, g) model, and show how to extend it to ARIMA and SARIMA models. 
We note that a state-space representation of ARMA models is not unique, and the 
particular representation we chose was that by Ledolter (1979) and Harvey and 
Phillips (1979). For other representations see Akaike (1974b), Jones (1980) and 
Chapter 12 of Brockwell and Davis (1991). 

Let 


P(B)(x, — W) = O(B)u,. 
Define a, = (a, ,,..-,4,,,), where 
=X — HM, 
ip = PX — WH... + Py Mpaj-1 — HY — Oj — Og gaj-1 
j=2,...,.gt1 
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and 
A, = OG) — MW) +... +O, p4j-1-— JH 74+ 2,...,P. 


It is left as an exercise to show that the state-space representation (8.101) holds where 
d and © are given by (8.102) and (8.103) respectively. In the definition of d, we take 
; = Ofori> g. 

The case g > p is reduced to the above procedure on observing that y, can be 
represented as an ARMA(q + 1, g) model where ¢; = 0 for i > p. Thus, in general 
the dimension of @ is k = max(p, q+ 1). 

The above state-space representation is easily extended to represent an 
ARIMA(p, d, g) model (8.108) by writing it as 


p*(B)(y, — H) = O(B)u, (8.178) 


where @*(B) = #(B)(1 — B)@ has order p* =p +d. Now we just proceed as above, 
with the ; replaced by the $* coefficients in the polynomial #*(B), resulting in the 
state-transition matrix ®*. For example, in the case of an ARIMA(I, 1, g) model we 
have f| = 1+ ¢, and ¢; = —¢. The order of ®* is k* = max(p*, q + 1). 

The above approach also easily handles the case of a SARIMA model (8.109). 
One just defines 


$*(B) = o(B)®(B')(1 — B)“(1 — BY)?, (8.179) 
0*(B) = 0(B)Q(B’) (8.180) 


and specifies the state-transition matrix ®* and vector d* based on the coefficients of 
polynomials @*(B) and 6*(B) of order p* and q* respectively. The order of ®* is now 
k = max(p*,q* + 1). 


8.18 Recommendations and software 


For ARIMA and REGARIMA models we recommend the filtered t-estimators 
(Section 8.8.3) computed with arima.rob (robust arima). 

When used for ARIMA models without covariables, the formula should be of the 
form x ~ 1, where x is the name of the time series. In this case the intercept corre- 
sponds to the mean of the time series, or to the mean of the differenced time series if 
the specified order of differencing is positive. 


8.19 Problems 


8.1. Show that |p(1)| < 1 for p(1) in (8.3). Also show that if the summation in the 
denominator in (8.3) ranges only from 1 to T — 1, then |p(1)| can be larger 
than one. 


8.4. 


8.5. 
8.6. 
8.7. 


8.8. 
8.9. 


8.10. 
8.11. 
8.12. 
8.13. 


TIME SERIES 
Show that for a “doublet” outlier at fp (i.e., y,, = A = —y,,41) with t € (1,7), 
the limiting value as A > oo of p(1) in (8.3) is —0.5. 


Show that the limiting value as A > oo of p(1) defined in (8.4), when there is 
an isolated outlier of size A, is -1/T + O(1/T?). 


Construct a probability model for additive outliers v, that has non-overlapping 
patches of length k > 0, such that v, = A within each patch and v, = 0 other- 
wise, and with P(v, # 0) = «. 


Verify the expression for the Yule—Walker equations given by (8.28). 
Verify that for an AR(1) model with parameter @ we have p(1) = @. 


Show that the LS estimator of the AR(p) parameters given by (8.26) is equiv- 
alent to solving the Yule-Walker equation(s) (8.28) with the true covariances 
replaced by the sample ones (8.30). 


Prove the orthogonality condition (8.37). 
Verify (8.160)-(8.162). 

Prove (8.169). 

Prove (8.40). 

Verify (8.13) using (8.134). 


Calculate the spectral density for the case that x, and x, ,, are replaced by A 
and —A respectively. 


9 


Numerical Algorithms 


Computing M-estimators involves function minimization and/or solving nonlinear 
equations. General methods based on derivatives — like the Newton-Raphson proce- 
dure for solving equations — are widely available, but they are inadequate for this 
specific type of problem, for the reasons given in Section 2.10.5.1. 

In this chapter we consider some details of the iterative algorithms used to com- 
pute M-estimators, as described in earlier chapters. 


9.1 Regression M-estimators 


We shall justify the algorithm in Section 4.5 for solving (4.39); this includes location 
as a special case. Consider the problem 


h(B) = min, 


where 


m= Yo(), 


i=1 


where r(B) = y; — xB and o is any positive constant. 

It is assumed that the x; are not collinear, otherwise there would be multiple solu- 
tions. It is assumed that p(r) is a p-function, that the function W(x) defined in (2.31) 
is nonincreasing in |x|, and that y is continuous. These conditions are easily verified 
for the Huber and the bisquare functions. 


Robust Statistics: Theory and Methods (with R), Second Edition. 

Ricardo A. Maronna, R. Douglas Martin, Victor J. Yohai and Matias Salibidn-Barrera. 
© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. 
Companion website: www.wiley.com/go/maronna/robust 
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It will be proved that / does not increase at each iteration, and that if there is a 
single stationary point By of h; that is, a point satisfying 


rv (ee) x, = 0, (9.1) 
i=l a 


then the algorithm converges to it. 
For r > 0 let g(r) = p( \/r). It follows from p(r) = g(r) that 


W(r) = 28'(r) (9.2) 


and hence W(r) is nonincreasing for r > 0 if and only if g’ is nonincreasing. 
We claim that 


gy) < s+ 8/'W-»; (9.3) 


that is, the graph of g lies below the tangent line. To show this, assume first that y > x 
and note that by the intermediate value theorem, 


gy) — g(®) = (y—x)8"(8), 


where € € [x, y]. Since g’ is nonincreasing, g’(€) < g’(x). The case y < xis dealt with 
likewise. 
A function g with a nonincreasing derivative satisfies for all x, y and all a € [0, 1] 


g(ax + (1 — a)y) = ag(x) + (1 — a)g(y); (9.4) 
that is, the graph of g lies above the secant line. Such functions are called concave. 
Conversely, a differentiable function is concave if and only if its derivative is non- 


increasing. For twice differentiable functions, concavity is equivalent to having a 
nonpositive second derivative. 


Define the matrix 
U(B) = y w (: Os XX, 


which is nonnegative definite for all 6, and the function 


i(B) 
f(B) = arg min > W (2 (yj -xly)’. 
The algorithm can then be written as 


Buu =f(By)- (9.5) 


A fixed point By — that is, one satisfying f(By) = By — is also a stationary 
point (9.1). 
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Given f,, put for simplicity w; = W(7,(B,)/o). Note that B,,, satisfies 


> WiXiV; = -y W:X;X; Bie = UB) By41- (9.6) 
=| 


We shall show that 
h(Byz1) < h(B,). (9.7) 


We have, using (9.3) and (9.2), 


WB) — WB) < 4 Yel (ee (Bay ) Pau? ~ 1B.) 
o2 


i=1 
= st y WT (Brrr) — ABC Bre) + 7(B,))- 
i=l 
But since 


ri(Byzi) — TB) = (By — Bey)"; and 
(Bra) + 7i(By) = 29; — X(Be + Bus) 


we have, using (9.6), 


A(Bx1) — ABy) < 


— Bis) > WiX;X;(2Bx41 — Be - By+1) 
i=l 


l ! 
=~ 552 Px — By) UB) Bis: — Be) <9 


since U(B,) is nonnegative definite. This proves (9.7). 

We shall now prove the convergence of f; to Bp in (9.1). To simplify the proof, 
we make the stronger assumption that p is increasing and hence W(r) > 0 for all r. 
Since the sequence /(f,) is nonincreasing and is bounded from below, it has a limit 
hg. Hence the sequence f;, is bounded, otherwise there would be a subsequence f i; 
converging to infinity, and since p is increasing, so would h(;.). 

Since B; is bounded, it has a subsequence that has a limit 6p, which by continuity 
satisfies (9.5) and is hence a stationary point. If it is unique, then 6, — Bp; otherwise, 
there would exist a subsequence bounded away from fp, which in turn would have a 
convergent subsequence, which would have a limit different from Bo, which would 
also be a stationary point. This concludes the proof of (9.1). 

Another algorithm is based on pseudo-observations. Put 


(PB) =x) B + 6y (7 ?). 
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Then (4.40) is clearly equivalent to 
> x(5,(B) — xB) = 0. 
i=l 


Given f;, the next step of this algorithm is finding f;,, such that 


¥ X;(¥(B,) — X;Bx41) = 9, 
i=l 


which is an ordinary LS problem. The procedure can be shown to converge (Huber 
and Ronchetti, 2009, Sec. 7.8) but it is much slower than the reweighting algorithm. 


9.2 Regression S-estimators 


Here we deal with the descent algorithm described in Section 5.7.1.1. As explained 
there, the algorithm coincides with the one for M-estimators. The most important 
result is that, if W is nonincreasing, then at each step 6 does not increase. 

To see this, consider at step k the vector f; and the respective residual scale o;, 


which satisfies 
1 r(B,) 
= —— }=6. 
0 py ( OK 


The next vector f,,, is obtained from (9.5) (with o replaced by o;,), and hence 
satisfies (9.7). Therefore 


ly (Brat) ly TAB) \ _ 
7 ae br? ao ba on 


i=] i=1 


Since 6,4, satisfies 
= r(B, 

1 o( | =%. (9.9) 
i=l 

and p is nondecreasing, it follows from (9.9) and (9.8) that 


Ong, S Op. (9.10) 


9.3. The LTS-estimator 


We shall justify the procedure in Section 5.7.1.2. Call 6, and 6, the scales corre- 
sponding to B, and f, respectively. Fork = 1,2 let rj, = y,; — x B,, be the respective 
residuals, and call a the ordered squared residuals. Let J C {1,...,} be the set of 
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indices corresponding to the smallest a . Then 


h 
65 = 2 nS as Le il =; (i)l = 6}. 


iel 


9.4 Scale M-estimators 


9.4.1 Convergence of the fixed-point algorithm 


We shall show that the algorithm (2.80) given for solving (2.49) converges. 
Define W as in (2.54). It is assumed again that p(r) is a p-function of |r|. For r > 0 
define 
g(r) = py). (9.11) 


It will be assumed that g is concave (see below (9.4)). To make things simpler, we 
assume that g is twice differentiable, and that g’’ < 0. 

The concavity of g implies that W is nonincreasing. In fact, it follows from 
W(r) = 9(r7)/r? that 


W'(r) = = (r°g'(r*) — g(r") $0, 
r 
since (9.3) implies for all ¢ 


0 = 80) < g) + 8’ OO-1 = gd) — t2"(0). (9.12) 


Put for simplicity 9 = o? and y; = a, Then (2.49) can be rewritten as 


Ia ) 
os a5 
n p ? ( 0 
and (2.80) can be rewritten as 
O44 = h(O), (9.13) 


with h 
1 Ji 
wo = =5 else (9.14) 


It will be shown that h is nondecreasing and concave. It suffices to prove these 
properties for each term of (9.14). In fact, for all y, 


S(ee(g))=9@)-HG)e0 as 


because of (9.12); and 
&((3)) = Reo a1 


because g” < 0. 
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We shall now deal with the resolution of the equation 
h(@) = 0. 

Assume it has a unique solution 0). We shall show that 
A, > 9p. 


Note first that h’(09) < 1. For 


% 
mon) = | h'(t)dt, 
0 


and if h’(@)) > 1, then h’(t) > 1 for t < 8, and hence A(@)) > 89. Assume first that 
0, > 8. Since h is nondecreasing, 0, = h(0,) = h(@)) = 0). We shall prove that 
0, < 0,. In fact, 


0, = h(O,) < AO) + h'(O)(O; — Ay) < % + (0; — 9%) = 9. 


In the same way, it follows that 0) < 6,4; < ,. Hence the sequence 0, decreases, and 

since it is bounded from below, it has a limit. The case 0, < Op is treated likewise. 
Actually, the procedure can be accelerated. Given three consecutive values 6;, 

O41; and 6,,5, the straight line determined by the points (,, 0,4,) and (0,41, O¢42) 

intersects the identity diagonal at the point (6*, 0*) with 

Oe — 9x42 


* = —— 
20441 ~ P42 — 9 


Then set 0,,3 = 0*. The accelerated procedure also converges under the given 
assumptions. 


9.4.2 Algorithms for the non-concave case 


If the function g in (9.11) is not concave, the algorithm is not guaranteed to converge 
to the solution. In this case (2.49) has to be solved by using a general equation-solving 
procedure. For given x,,...,x,, let 


on? 


wo) = 15 (2) a, (9.17) 


i=l 
Then we have to solve h(o) = 0. Procedures using derivatives, like the Newton— 
Raphson, cannot be used, since the boundedness of p implies that h’ is not bounded 
away from zero. Safe procedures without derivatives require locating the solu- 
tion in an interval [o,,0,] such that sgn(h(o,)) # sgn(h(o,)). The simplest is the 
bisection method, but faster ones exist and can be found, for example, in Brent 
(1973). 
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To find o, and o5, recall that / is nonincreasing. Let og = Med(|x|) and set 
0, = oy. If h(o,) > 0, we are done; else set o, = o, /2 and continue halving o, until 
h(o,) > 0. The same method yields o>. 


9.5 Multivariate M-estimators 


Location and covariance will be treated separately for the sake of simplicity. A very 
detailed treatment of the convergence of the iterative reweighting algorithm for simul- 
taneous estimation was given by Arslan (2004). 
Location involves solving 
h(w) = min, 
with 


h(u)= > p(d(u)). 
i=1 
where 
d,(u) =(x; — w)'=7"(x; — Ws 


which implies (6.11). The procedure is as follows. Given p;, let 


n 
1 
Me = yw: W;X;j, 


iS 0 7=1 


with w; = W(d,(u;,)) and W = p’. Hence 


YD) O% = Mest Di (9.18) 
i=1 i=] 


Assume that W is nonincreasing, which is equivalent to p being concave. It will 
be shown that h(y;,4,) < h(u;,). The proof is similar to that of Section 9.1. It is easy 
to show that the problem can be reduced to Z = J, so that d,(u) = ||x; — ||’. Using 
the concavity of p and then (9.18), 


n 


h(Hyy1) — hy) SY) willl&; — Meal!’ = WX; = Mel] 
i=1 


= (My — Bess)’ by Ww (2X; — Me — Hest) 
=] 
= (My — Mest) (Hea — HD > w; < 0. 


i=1 


The treatment of the covariance matrix is more difficult (Maronna, 1976). 
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9.6 Multivariate S-estimators 


9.6.1 S-estimators with monotone weights 


For the justification of the algorithm in Section 6.8.2, we shall show that if the weight 
function is nonincreasing, and hence p is concave, then 


Bri < by (9.19) 


Given py, and X,, define 6,,.,, My,, and Z;,, as in (6.59)-(6.60). It will be shown 


that 
= A(X), Mya. = ax, 4B 
Yo( (Mir Mac sv) < Yo( (x; a 2). (9.20) 
OK OK 


i=1 i=1 


In fact, the concavity of p yields (putting w, for the w,; of (6.59)): 


. UX}, Mes» B41) . AUX;, Uy, Bx) 
yell) om 


i=1 i=1 
n 


1 
es > d(x;, Mp1, d(x;, U,, &,)]. 9.22 
é;, W (A(X; Mpa > Mp1) (X;, Mg, 2) | ( ) 


i=1 


Note that y;,, is the weighted mean of the x; with weights w;, and hence it min- 
imizes )’", w;(x; — #)/A(x; — #4) for any positive definite matrix A. Therefore 


> Wd (Xj epi» B41) S », W(X}, Mj Dh41) 
i=l i=l 


and hence the sum on the right-hand side of (9.21) is not larger than 


DY) Wid, Mes Zest) — YY wid %, My, Zp) (9.23) 
i=1 i=l 


= VyZyi- Dy 2y, (9.24) 
i=l i=l 
with y; = 4/w,(x; — M;,). Since 
> € ; “Ivy 
Ley = ic with C= a Lv 


we have that 2, is the sample covariance matrix of the y; normalized to unit deter- 
minant, and by (6.35) it minimizes the sum of squared Mahalanobis distances among 
matrices with unit determinant. Since |Z,| = |2;,,| = 1, it follows that (9.23) is < 0, 
which proves (9.20). 
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Since 


lx A(X, Hig Re41) Bee 
oo er 


i=] Ox+1 


the proof of (9.19) follows like that of (9.10). 


9.6.2 The MCD 


The justification of the “concentration step” in Section 6.8.6 proceeds as in 
Section 9.3. Put for k = 1,2: dj, = d(x;, My, %,) and call d(,;, the respective ordered 
values and 6,,6, the respective scales. Let JC {1,...,n} be the set of indices 
corresponding to the smallest h values of d;;. Then ff and XZ, are the mean and the 
normalized sample covariance matrix of the set {x; : i € J}. Hence (6.35) applied to 


that set implies that 
Y dp S yaa = = Dai 


iel iel 


and hence 


2-3 don < Yd <a, 


iel 


9.6.3 S-estimators with non-monotone weights 


Note first that if p is not concave, the algorithm (2.80) is not guaranteed to yield the 
scale o, and hence the approach in Section 9.4.2 must be used to compute o. 

Now we describe the modification of the iterative algorithm for the S-estimator. 
Call (Ay, py yy) the estimators at iteration N, and o(fiy, Da xy) the respective scale. Call 
(Hy +1> Be +1) the values given by a step of the reweighting algorithm. 

I 6 (fig Beil < Oy By then we proceed as usual, setting 


iets Eyal = Ging Za): 


If instead 
O(My41, 241) 2 O(n, Uy), (9.25) 


then for a given € € R put 
(Awasts Ens) = 1 - 8)Ay. Ey) + 6 Giver Eva: (9.26) 
Then it can be shown that there exists € € (0, 1) such that 
O inci Bagi) < Oi, By): (9.27) 


The details are given below in Section 9.6.4. 
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If the situation in (9.25) occurs, then the algorithm proceeds as follows. Let & € 
(0, 1). Set € = & and compute (9.26). If (9.27) occurs, we are done. Otherwise, set 
€ = && and repeat the former steps, and so on. At some point we must have (9.27). 
In our programs we use € = 0.7. 

A more refined method would be a line search; that is, to compute (9.26) for 
different values of € and choose the one yielding minimum o. Our experiments do 
not show that this extra effort yields better results. 

It must be noted that when the computation is near a local minimum, it may hap- 
pen that because of rounding errors, no value of & yields a decrease in o. Hence it is 
advisable to stop the search when € is less than a small prescribed constant and retain 
(fly, 52 as the final result. 


9.6.4 *Proof of (9.27) 


Let A(z): R” > R be a differentiable function, and call g its gradient at the point 
z. Then for any b € R”, h(z+ Eb) = h(z) + Eg’b + o(€). Hence if g’b < 0, we have 
h(z + Eb) < h(z) for sufficiently small &. 

We must show that we are indeed in this situation. To simplify the exposition, 
we deal only with yw; we assume & fixed, and without loss of generality we may take 
x =I. Then d(x, w,Z) = ||x — p||*. Call o(p) the solution of 


Iw (lx; Hl? 
= —— _ } =6. 9.28 
- p p ( . (9.28) 
Call g the gradient of o(y) at a given yw. Then differentiating (9.28) with respect to 
p yields 


n 


DY, wil2o(x = wy) + IX — wy II] = 
i=] 


with 
ieee w(® a 
o 
and hence 
oe he = mi? | Yne ™ 
Call ly the result of an iteration of the reweighting algorithm; that is, 
ip) WX; 
Liat “i 
Then Ch 
fy - Hy = er Y, wax; — M1), (9.30) 


and it follows from (9.30) and (9.29) that (uy — ,)'g < 0. 


10 


Asymptotic Theory of 
M-estimators 


In order to compare the performances of different estimators, and also to obtain con- 
fidence intervals for the parameters, we need their distributions. Explicit expressions 
exist in some simple cases, such as sample quantiles, which include the median, but 
even these are in general intractable. It will be necessary to resort to approximating 
their distributions for large n, the so-called asymptotic distribution. 

We shall begin with the case of a single real parameter, and we shall consider 
general M-estimators of a parameter 0 defined by equations of the form 


» P(x, 0) = 0. (10.1) 
i=l 


For location, ¥ has the form V(x, 0) = w(x — 8) with @ € R; for scale, ‘P(x, 0) = 

p(\x|/0) — 6 with 6 > 0. If w (or p) is nondecreasing then is nonincreasing in 0. 
This family contains maximum likelihood estimators (MLEs). Let f,(x) be a 

family of densities. The likelihood function for an i.i.d. sample x,,...,x, with 


density fy is 
L= []4e0. 
i=l 


If fg is everywhere positive, and is differentiable with respect to @ with derivative 
fo = Of /00, taking logs it is seen that the MLE is the solution of 


3 Wo(x;,8) = 0, 
i=1 
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- 0 log fo(x) i,@) 
= og 0 x a 0 Xx 
P(x, 0) = 30 fe” (10.2) 


10.1 Existence and uniqueness of solutions 


We shall first consider the existence and uniqueness of solutions of (10.1). It 
is assumed that @ ranges in a finite or infinite interval (0),0,). For location, 
0, = —0, = ow; for scale, 0. = oo and 0, = 0. Henceforth the symbol m means “this 
is the end of the proof”. 


Theorem 10.1 Assume that for each x, ‘V(x, 8) is nonincreasing in 0 and 


je WY(x,0) > 0> ay W(x, 0) (10.3) 


(both limits may be infinite). Let 


n 


39) =) W(x;, 8). 


i=] 


Then: 
a) There is at least one point 6= O(x). .X,) at which g changes sign; that is, 
2(0) > 0 for 0 <@ and g(9) <0 ford >6 


b) The set of such points is an interval. 
c) If ¥ is continuous in @, then g(@) = 0. 
d) If ¥ is decreasing, then 0 is unique. 


Proof. It follows from (10.3) that 


bay g(0) >0> et g(0) (10.4) 


and the existence of @ follows from the monotonicity of g. If two values satisfy 
g(@) = 0, then the monotonicity of g implies that any value between them also does, 
which yields point (b). Statement (c) follows from the intermediate value theorem; 
and point (d) is immediate. | 


Example 10.1 Jf Y(x, 8) = sgn(x — @) — which is neither continuous nor increas- 
ing — then 
g(0) = #(x,; > 0) — #@; < A). 
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The reader can verify that for n odd, n = 2m — 1, g vanishes only at 6 = Xm, and for 
n even, n = 2m, it vanishes on the interval (Xony> Xom+1))- 


Example 10.2 The equation for scale M-estimation (2.49) does not satisfy (10.3) 
since p(O) = 0 implies ¥(0, 8) = —6 < 0 for all 0. But the same reasoning shows that 
(10.4) holds if 

——. <l 


n ~ p(oo)’ 


Uniqueness may hold without requiring the strict monotonicity of ‘¥. For instance, 
Huber’s y is not increasing, but the respective location estimator is unique unless 
there is a large gap in the middle of the data (Problem 10.7). A sufficient condition 
for the uniqueness of scale estimators is that p(x) be increasing for all x such that 
p(x) < p(co) (Problem 10.6). 


10.1.1 Redescending location estimators 


The above results do not cover the case of location estimators with a redescending y. 
In this case, uniqueness requires stronger assumptions than the case of monotone y. 
Uniqueness of the asymptotic value of the estimator requires that the distribution of 
x, besides being symmetric, is unimodal; that is, it has a density f(x), which for some 
is increasing for x < y and decreasing for x > yw. 


Theorem 10.2 Let x have a density f(x) which is a decreasing function of |x|, and 
let p be any p-function. Then A(u) = Ep(x — ps) has a unique minimum at yu = 0. 


Proof. Recall that p is even and hence its derivative y is odd. Hence the derivative of 
Ais 


Miz 7 fevG=e 


-{ w(x) [fe — w) — fet | dx. 


We shall show that A’() > O if w > 0. It follows from the definition of p-function that 
w(x) > Oforx > Oand w(x) > Oifx € (0, x9) for some x. If x and yp are positive, then 
|x — pw| < |x + p| and hence f(x — yw) > f(x + 4), which implies that the last integral 
above is positive. o 


If w is redescending and f is not unimodal, the minimum need not be unique. Let, 
for instance, f be a mixture: f = 0.5f, + 0.5f,, where f, and f, are the densities of 
N (k, 1) and N (—k, 1), respectively. Then if k is large enough, A(z) has two minima, 
located near k and —k. The reason can be seen intuitively by noting that, if k is large, 
for up > 0, A(z) is approximately 0.5 ti p(x — p)f; (x) dx, which has a minimum at k. 
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Note that instead the asymptotic value of a monotone estimator is uniquely defined 
for this distribution. 


10.2 Consistency 


Let x,,...,x, now be i.i.d. with distribution F. We shall consider the behavior of 
the solution @, of (10.1) as a random variable. Recall that a sequence y,, of random 
variables tends in probability to y if P(|y,, —y| > €) > 0 for all € > 0; this will be 
denoted by y, >, y or plimy,, = y. The sequence y,, tends almost surely (a.s.) or 
with probability one to y if P(lim,_,,, y, = y) = 1. The expectation with respect to a 
distribution F' will be denoted by E,. 

We shall need a general result. 


Theorem 10.3 (Monotone convergence theorem) Let y, be a nondecreasing 
sequence of random variables such that Ely,| < co, and y, — y with probability 
one. Then 

Ey, > Ey. 


n 


The proof can be found in Feller (1971). 
Assume that E;-|'¥(x, 0)| < co for each 0, and define 


Ap(0) = Ep P(x, 0). (10.5) 


Theorem 10.4 Assume that E,|‘¥(x, @)| < co for all 0. Under the assumptions of 
Theorem 10.1, there exists 0; such that A, changes sign at 0p. 


Proof. Proceed along the same lines as the proof of Theorem 10.1. The interchange 
of limits and expectations is justified by the monotone convergence theorem. o 


Note that if A; is continuous, then 


E,¥(x, 0) = 0. (10.6) 
Theorem 10.5 /f 0; is unique, then 6, tends in probability to 0p. 


Proof. To simplify the proof, we shall assume 6, unique. Then it will be shown that 
for any € > 0, 
lim P(O, < 0, —¢€) = 0. 


Let 7 
~ 1 
A,(0) = -— Wx;, 0). 
(0) fee (x;, 0) 
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Since A, is nonincreasing and 6, is unique, 6, < 0, — € implies 4,0 ¢ — €) < 0. Since 
A,(07 — €) is the average of the i.i.d. variables Y(x,,0; —e), and has expectation 


A(Op — €) by (10.5), the law of large numbers implies that 


4,07 — €) >, AOp — €) > 0. 


Hence 
lim P(6,, < 0; — €) < lim P(A, (67 — €) < 0) = 0. 
The same method proves that PO, > O-+e) > 0. a 


Example 10.3 For location ¥(x, 0) = w(x — 8). If w(x) = x, then ¥ is continuous 
and decreasing, the solution is 6, =x and 4(0) = Ex — 0, so that 0, = Ex; conver- 
gence occurs only if the latter exists. 

If w(x) = sgn(x), we have 


A(O) = P(x > 0) — Px < 8), 
hence 0; is a median of F,, which is unique iff 
F(0;+6€)>F(O0--—e)Ve> 0. (10.7) 


In this case, for n = 2m the interval (X(m),Xqn41)) Shrinks to a single point when 


m — oo. If 10.7 does not hold, the distribution of 6, does not converge to a point-mass 
(Problem 10.2). 


Note that for model (2.1), if y is odd and D(e) is symmetric about 0, then A(0) = 0 
so that 6, = 0. For scale, Theorem 10.5 implies that estimators of the form (2.49) tend 
to the solution of (2.50) if it is unique. 


10.3. Asymptotic normality 


In Section 2.10.2, the asymptotic normality of M-estimators was proved heuristically, 
by replacing y with its first-order Taylor expansion. This procedure will now be made 
rigorous. 

If the distribution of z,, tends to the distribution H of z, we shall say that z,, tends 
in distribution to z (or to H), and shall denote this by z, >, z (or z,, ~q H). We shall 
need an auxiliary result. 


Theorem 10.6 (Bounded convergence theorem) Let y,, be a sequence of random 
variables such that |y,,| < z where Ez < o and y, > ya.s.. Then Ey, > Ey. 


The proof can be found in (Feller, 1971). 
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Theorem 10.7 Assume that A = EV(x,6;)? < 00 and that B = 4'(6;,) exists and is 
nonnull. Then the distribution of Vn (6,, — 9,-) tends to N(O, v) with 


A 
B2 
If ¥(x, 0) = 0/00 exists and verifies for all x, 
|'P(x, 0)| < K(x) with EK(x) < 0, (10.8) 


v= 


then B = E¥(x, O,). 


Proof. To make things simpler, we shall make the extra (and unnecessary) assump- 

tions that P(x, 0) = aw / 06” exists and is bounded, and that W verifies (10.8). A 

completely general proof may be found in Huber and Ronchetti (2009, Sec. 3.2). 
Note first that the bounded convergence theorem implies B = E(x, 0;,). In fact, 


W(x, 07 + 6) — P(x, 0 
eine ee 
60 6 
The term in the expectation is < K(x) by the mean value theorem, and for each x tends 
to V(x, O,). 
A second-order Taylor expansion of ¥ at 6, yields 


Hx.) = HO) + G, = O-)Hx, Op) + 5B, — Op) Hx, 6) 
where 0; is some value (depending on x;) between 6, and 6. Summing over i yields 
0= A, co 6, ~ Op)B,, Bs @, _ On) Cys 


where 


1 n 1 n . 1 n a 
A, == 2 V(x}, 97), By = — 2 VO; Or). Ca = 5 p (x; 0;) 
and hence 
ee nA, 
Vn@, ~ 0) = -———__. 
Bi, ay (6, ~ On )C,, 

Since the i.i.d. variables ‘Y(x,, 0) have mean 0 (by (10.6)) and variance A, the central 
limit theorem implies that the numerator tends in distribution to N(0, A). The law of 
large numbers implies that B, >, B; and since C,, is bounded and (4, — 0-) >, 0 by 
the former theorem, Slutsky’s lemma (Section 2.10.3) yields the desired result. 


Example 10.4 (Location) = For the mean, the existence of A requires that of Ex’. 
In general, if w is bounded, A always exists. If y' exists, then A'(t) = E w'(x — t). For 
the median, w is discontinuous, but if F has a density f, explicit calculation yields 


A(O) = P(x > 0) — P(x < 0) = 1 -—2F(6), 
and hence 4! (07) = —2f (6p). 
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If 4’(6,,) does not exist, 6, tends to 6, faster than n@'/?, and there is no 
asymptotic normality. Consider, for instance, the median with F discontinuous. Let 
y(x) = sgn (x), and assume that F' is continuous except at zero, where it has its 
median and a point mass with P (x = 0) = 26; that is, 


lim F(x) = 0.56, lim FQ) = 0.5 +6. 


Then AO) =1- cn, has a jump o _ = (0. We shall see that this entails 
P(@, = 0) > 1, anda fortiori \/n 0, > 

Let N,, = #(x; < 0), which is binomial Bi(n, p) with p = 0.5 — 6. Then 6, <0 
implies N,, > n/2, and therefore 


P@, < 0) < PIN, /n > 0.5) > 0 


since the law of large numbers implies N,,/n >,p < 0.5. The same method yields 
PO, > 0) > 0. 

The fact that the distribution of 6, tends to a normal N(@;, v) does not imply 
that the mean and variance of 6, tend to @; and v (Problem 10.3). In fact, if F is 
heavy tailed, the distribution of 6, will also be heavy tailed, with the consequence 
that its moments may not exist, or, if they do, they will give misleading information 
about DO, ). In extreme cases, they may even not exist for any n. This shows that, as 
an evaluation criterion, the asymptotic variance may be better than the variance. Let 
T,, = 6, =) oo) Vn/v, where 6, is the median and v its asymptotic variance under 
F, so that T,, should be approximately N(O, 1). Figure 10.1 shows for the Cauchy 
distribution the normal Q—Q plot of 7,,, that is, the comparison between the exact 
and the approximate quantiles of its distribution, for n = 5 and 11. It is seen that 
although the approximation improves in the middle when v increases, the tails remain 
heavy. 


10.4 Convergence of the SC to the IF 


In this section we prove (3.6) for general M-estimators (10.1). Call 6, the solution of 
(10.1), and for a given x9, call 6,,, ;(%9) the solution of 


> W(x, 0) = 0. (10.9) 
i=0 
The sensitivity curve is 


SC, (%) = (n+ 1) (Ops) = 6,) 


and 
V(x, Or) 


B 
with B and 6, defined in Theorem 10.7 and (10.6) respectively. 


IF(%) = — 
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Exact Quantiles 


-3 -2 -1 0 1 2 3 
Normal Quantiles 


Figure 10.1 Q-—Q plot of the sample median for Cauchy data. The dashed line is the 
identity diagonal. 


Theorem 10.8 Assume the same conditions as in Theorem 10.7. Then for each xo 


SC, (%q) >p IF g(x). 


Proof. Theorem 10.5 states that 6, —, 9,. The same proof shows that also 


6, +10) >p 9p, since the effect of the term P(x, @) becomes negligible for large n. 
Hence 
A, =: 841%) ~ 6, ~p 0. 


Using (10.1) and (10.9) and a Taylor expansion yields 


0 = WC%p, Bus o)) + Dy [WO Iuar(%o)) — VOX 8,)| (10.10) 
i=1 
a n . A2 n 
= Y(x9, O41 (%)) +A, D, PO,8,) + a py P(x; 0,), 


i=l 


where 6; is some value between 6, 41%) and 6,,. Put 


= 2 Yd) n = 2D Hesse. i 
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Then C,, is bounded, and the consistency of 6. plus a Taylor expansion, show that 
B, >, B. It follows from (10.10) that 


C0; G41 %0)) n+ 1 
B,+C,A,/2 n 


n~n 


SC,,(X9) = 


And since P(x, 6, +1%)) >, P(xo, 97), the proof follows. a 


10.5  M-estimators of several parameters 


We shall need the asymptotic distribution of M-estimators when there are several 
parameters. This happens in particular with the joint estimation of location and 
scale in Section 2.6.2, where we have two parameters, which satisfy a system of 
two equations. This situation also appears in regression (Chapter 4) and multivariate 
analysis (Chapter 6). Put 8 = (4,0), and 


Y (x,0) =y (=—*) and W(x, 0) = Pycare (—*) me 


Then the simultaneous location—scale estimators satisfy 


>! Vx; 0) = 0, (10.11) 
i=l 


with ¥ = (¥,,%,). Here the observations x; are univariate, but in general they 
may belong to any set ¥ C R%, and we consider a vector 0=(0,,...,4,)' of 
unknown parameters, which ranges in a subset © C R?, which satisfies (10.11) 
where WB = (W),... A) is function of ¥ x @ > R?. Existence of solutions must 
be dealt with in each situation. Uniqueness may be proved under conditions which 
generalize the monotonicity of ¥ in the case of a univariate parameter (as in part (d) 
of Theorem 10.1). 


Theorem 10.9 Assume that for all x and 0, ‘¥(x, 0) is differentiable and the matrix 
D= D(x, @) with elements 0¥ ;/00; is negative definite (i.e., a'Da < 0 for alla#0). 


Put for given X), ...,Xp 
n 


3(0) =)" ¥(;, 0). 


i=l 


If there exists a solution of g(@) = 0, then this solution is unique. 


Proof. We shall prove that 


8(9;) # gz) if O, # Os. 
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Let a=0,—9,, and define for t€ R the function h(t) = a’g(@, + ta), so that 
h(O) = a’g(9,) and A(1) = a’g(@,). Its derivative is 


h(t) = Y' a'D(x, 0, + tala < OV 1, 


i=1 


and hence h(0) > h(1) which implies g(0,) # 9(@>). o 


To consider consistency, assume the x; are i.i.d with distribution F, and put 


7,(0) = ; YY; 0) (10.12) 


and 
(0) = Ep'P(x, 8). (10.13) 


Let 6, be any solution of (10.11); it is natural to conjecture that, if there is a unique 
solution 0; of A(@) = 0, then asin > oo, 6, tends in probability to 6;,. General criteria 
are given in Huber and Ronchetti (2009, Sec. 6.2). However, their application to each 
situation must be dealt with separately. 

For asymptotic normality, we can generalize Theorem 10.7. Assume that 
6, —, 9, and that / is differentiable at 6;,, and call B the matrix of derivatives with 
elements 

dA; 


= — (10.14) 
26; loo, 


Assume B is nonsingular. Then under general assumptions (Huber and Ronchetti, 
2009, Sec. 6.3) . , 
Vn, — O07) 4 N,(0,B-!AB“! ) (10.15) 


where 
A=EW(, 07) VG, On)’, (10.16) 


and N,(t, V) denotes the p-variate normal distribution with mean t and covariance 
matrix V. 
If ¥;, = 0V;/00, exists and verifies, for all x, 0, 


|Wi.(x, @)| < K(x) with EK(x) < oo, (10.17) 


then B = EW(x, 0), where W is the matrix with elements Wy 

The intuitive idea behind the result is like that of (2.89)-(2.90): we take a 
first-order Taylor expansion of Y around 6, and drop the higher-order terms. 
Before dealing with the proof of (10.15), let us see how it applies to simultaneous 
M-estimators of location—scale. Conditions for existence and uniqueness of solutions 
are given in Huber and Ronchetti (2009, Sec. 6.4) and Maronna and Yohai (1981). 
They may hold without requiring monotonicity of yw. This holds in particular 
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for the Student MLE. As can be expected, under suitable conditions they tend in 
probability to the solution (9,09) of the system of equations (2.74)—(2.75). The 
joint distribution of ALOT — [,6 — oy) tends to the bivariate normal with mean 0 
and covariance matrix 


V=B'AB!Y, (10.18) 
where 
A= Bs | at i | 
49, 42 o |b, 22 
with 


a= Ey(ry, 412 = Ay) = EDgcate(”) — SW), 422 = EDscate(r) — 5), 
where xp 
r= a) 
% 


and 
by = Ey'(r), di = Ery'(r), 
by = EP scale)» Dy = Er cate(")- 


If w is odd, p.oaje iS even, and F is symmetric, the reader can verify (Problem 10.5) 


that V is diagonal, 
— {911 0 
v= 0 i : 


so that #@ and G are asymptotically independent, and their variances take on a simple 
form: 


a:: 
=o, (=1,2); 


JJ 


vii 


that is, the asymptotic variance of each estimator is calculated as if the other parameter 
were constant. 

We shall now prove (10.15) under much more restricted assumptions. We shall 
need an auxiliary result. 


Theorem 10.10 (Multivariate Slutsky lemma) Let u,, and v,, be two sequences of 
random vectors and W,, a sequence of random matrices such that for some constant 
vector u, random vector v and random matrix W 


u, >), V, —1V, W,, —>, W. 
Then 


u, +V, ?~gut+v and W,v, >, Wv. 


The proof proceeds as in the univariate case. 
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Now we proceed with the proof of asymptotic normality under more restricted 
assumptions. Let 0, be any solution of (10.11). 


Theorem 10.11 Assume that 6, —,9p, where Op is the unique solution of 
A,(0) =0. Let B be twice differentiable with respect to 9 with bounded derivatives, 
and satisfying also (10.17). Then (10.15) holds. 


Proof. The proof follows that of Theorem 10.7. For each j, call WY, the matrix with ele- 
ments 0Y;/00,00), and C,,(x, @) the matrix with jth row equal to 6, — 6, Hix, 6). 
By a Taylor expansion 


n 


0=1,0,) =D {¥oi.6-) +¥0;6p) (8, - OF) + 5 Cnt 0;) (6, - 6r) }. 
i=1 


In other words _ " 
0=A,+(B, +C,) (6, - 4), 


with 
1 n 1 n : _ 1 n 
Dy = n 2 YC 9r)s B, = - 2 PO, 9p)s C, = on p CQ 0;); 
that is, C, is the matrix with the jth row equal to 6, —O0,)' ae where 
es oe 
YF = - 2 W(x), 0,), 


which is bounded. Since 6, — 0; >, 9, this implies that also c, —,0. We have 


Vn, ~ Op) = —(B,, a Cy VnA,. 


Note that for i= 1,2,..., the vectors P(x;,0,) are iid. with mean 0 (since 
A(8;;)=90) and covariance matrix A, and the matrices W(x;, 0) are iid. with 
mean B. Hence when n — oo, the law of large numbers implies B,, >, B, which 


implies B,, +C, —,B, which is nonsingular. The central limit theorem implies 


JA, > N,(0,A), and hence (10.15) follows by the multivariate version of 
Slutsky’s lemma. oO 


10.6 Location M-estimators with preliminary scale 


We shall consider the asymptotic behavior of solutions of (2.67). For each n let G,, be 
a dispersion estimator, and call j7,, the solution (assumed unique) of 


vv (=*) =p (10.19) 
5 (oy 


n 
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For consistency, it will be assumed that 
Al wy is monotone and bounded with a bounded derivative; 
A2 o = plimG, exists; 
A3 the equation Ey ((x — )/o) = 0 has a unique solution po. 
Theorem 10.12 If A/—A2—A3 hold, then fi, +5 lg: 
The proof follows the lines of Theorem 10.5, but the details require much more 


care, and are hence omitted. 
Define now u; = x; — Mp and 


2 
a= Ey(*) ,b=By' (4), c=E(4)y’(4). (10.20) 
o o o o 
For asymptotic normality, assume 
A4 the quantities defined in (10.20) exist and b $ 0; 
AS JnG, — o) converges to some distribution; 
A6 c=0. 
Theorem 10.13 Under A4—A5-A6, we have 
Vii, — Uo) >¢N(O, v) with v = ae (10.21) 


Note that if x has a symmetric distribution, then jig coincides with its center of 
symmetry, and hence the distribution of u is symmetric about zero, which implies 
(since y is odd) that c = 0. 

Adding the assumption that y has a bounded second derivative, the theorem 
may be proved along the lines of Theorem 10.5, but the details are somewhat more 
involved. We shall content ourselves with a heuristic proof of (10.21) to exhibit the 
main ideas. 

Put for brevity 2. ” 

Ain = En — Ho» Aon = Cn — 6. 


Then expanding y as in (2.89) yields 


xi — it, _ uu; — Ain 
vy | —— ] =¥| — 
on o+ Aon 


Sa Ui; ! Ui; Ai, + Az,u;/o 
xy S —y | — }) ————_. 


oO oy 


Inserting the right-hand side of this expression into (2.67) and dividing by n yields 


Inn 


0=A,-+ (4,8, + 4,6), 
oO 
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1 - u; 1 = Uu; 1 & Uu: Uu: 
A= LY (F) B= 5 YS) = a(S) (Z): 
a aoe o tt ad o n&\o - o 
and hence 


x A, — A 
VnA,, = = (10.22) 


Now A,, is the average of i.i.d. variables with mean 0 (by (10.19)) and variance 
a, and hence the central limit theorem implies that Jn, —,N(0, a); the law of large 
numbers implies that B, >, b. If c = 0, then V/nC,, tends to a normal by the central 
limit theorem, and since Box —, 0 by hypothesis, Slutsky’s lemma yields (10.21). 

If c # 0, the term As, /nC, does not tend to zero, and the asymptotic variance of 
A,,, Will depend on that of G, and also on the correlation between G,, and 7, 


10.7. Trimmed means 


Although the numerical computing of trimmed means — and in general of 
L-estimators — is very simple, their asymptotic theory is much more complicated 
than that of M-estimators; even heuristic derivations are involved. 

It is shown (see Huber and Ronchetti 2009, Sec. 3.3) that under suitable regularity 
conditions, 6, converges in probability to 


0. = Epaxl(k, <x <k), (10.23) 


1 
1 —2a 
where 

k, =F \(a), kb = F' —@). (10.24) 


Let F(x) =Fo(x— yp), with Fy symmetric about zero. Then 0,=y 
(Problem 10.4). 
If F is as above, then \/n(@ — p) >,N(O, v) with 


1 2 
v= d—2aypeF W(x — WY, (10.25) 


where y;, is Huber’s function with k = Fy '(1 — a), so that the asymptotic variance 
coincides with that of an M-estimator. 


10.8 Optimality of the MLE 


It can be shown that the MLE is “optimal”, in the sense of minimizing the asymptotic 
variance, in a general class of asymptotically normal estimators (Shao, 2003). Here 
its optimality will be shown within the class of M-estimators of the form (10.1). 
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The MLE is an M-estimator, which under the conditions of Theorem 10.5 is 
Fisher-consistent; that is, it verifies (3.32). In fact, assume f, = df /00 is bounded. 


Then differentiating 
/ Fo(xdx = 1 


0= / 7 fyodx = - / ’ W(x, Ofy(xddx V 0, (10.26) 


with respect to 6 yields 


so that (10.6) holds (the interchange of integral and derivative is justified by the 
bounded convergence theorem). 
Under the conditions of Theorem 10.7, the MLE has asymptotic variance 


Ao 
Uo = 2” 
Bo 
with co co 
Ay = / Pix, Ofg(x)dx, By = / P(x, O)f(xdx, 
with Po = 0V,/00. The quantity Ag is called the Fisher information. 


Now consider another M-estimator of the form (10.1), which is Fisher-consistent 
for 0; that is, such that 


* V(x, Ofy(xdx = 0V 4, (10.27) 


—-—o 


and has asymptotic variance 


with = és 
A= / W(x, O)fg(x)dx, B= / Px, O)fy(adx. 
It will be shown that 
tp Se (10.28) 


We shall show first that By = Ag, which implies 


1 


b= a (10.29) 


In fact, differentiating the last member of (10.26) with respect to @ yields 


ial fy (x) 
0=B)+ / _ Bol ayo fo = By — Ao. 
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By (10.29), (10.28) is equivalent to 
BZA A, (10.30) 
Differentiating (10.27) with respect to @ yields 


B= / * V(x, Oo (x, Ofy(xddx = 0. 


co 


The Cauchy—Schwarz inequality yields 


( r V(x, 0 o(x, ayfcsids) Z ( / * wor, afc) 


x ( / i WG ays) : 


which proves (10.30). 


10.9 Regression M-estimators: existence 
and uniqueness 


From now on it will be assumed that X has full rank, and that G is fixed or estimated 
previously (i.e., it does not depend on f). We first establish the existence of solutions 
of (4.39). 


Theorem 10.14 Let p(r) be a continuous nondecreasing unbounded function of |r|. 
Then there exists a solution of (4.39). 


Proof. Here ¢ plays no role, so that we may put 6 = 1. Since p is bounded from 
below, so is the function 


R(B) = Y) oy; - X/). (10.31) 
i=l 


Call L its infimum; that is, the larger of its lower bounds. We must show the existence 
of By such that R(B,) = L. It will first be shown that R(B) is bounded away from L if 
|| B|| is large enough. Let 
a= min max |x’B|. 
||Bl|=ligl.n | 

Then a > 0, since otherwise there would exist 6B #0 such that <8 = 0 for all 1, 
which contradicts the full rank property. Let by) > 0 be such that p(bg) > 2L, and b 
such that ba — max,|y,| > 0). Then ||6|| > 4 implies max,|y; — x’ B| > by, and hence 
R(B) > 2L. Thus minimizing R for B € R? is equivalent to minimizing it on the closed 
ball {||B|| < b}. A well-known result of analysis states that a function that is contin- 
uous on a closed bounded set attains its minimum in it. Since R is continuous, the 
proof is completed. o 
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Now we deal with the uniqueness of monotone M-estimators. Again we may take 
G=1. 


Theorem 10.15 Assume y nondecreasing. Put for given (x;,, y;) 


Lia)= Dy (=) xj. 


i=] 
Then (a) all solutions of L(B) = 0 minimize R(B) defined in (10.31) and (b) if further- 
more w has a positive derivative, then L(B) = 0 has a unique solution. 


Proof. 

(a) Without loss of generality we may assume that L(0) = 0. For a given B let 
A(t) = R(tB) with R defined in (10.31). We must show that H(1) > H(0). Since 
dH(t)/dt = B'L(tB), we have 


n 


1 
H(1) — H(0) = R(B) - RO) = | y/(tx,B = yi)(x, B)dt. 
i=1 


If xB > 0 (resp. < 0), then for tr > 0 y (tx, B — y,) is greater (resp. smaller) than 
w(—y,). Hence 
w (tx; B — y;)(x; B) = w(—y,)(%;B), 


which implies that 


R(B) — RO) = DY w(-y)%B) = BLO). 


i=] 


(b) The matrix of derivatives of L with respect to B is 


n y 
1 yi —X;B 
er v( aa 
i=] 


which is negative definite; and the proof proceeds as that of Theorem 10.9. 


The former results do not cover MM- or S-estimators, since they are not 
monotonic. As was shown for location in Theorem 10.2, uniqueness holds for the 
asymptotic value of the estimator under the model y; = ye B+ u; if the u;s have a 
symmetric unimodal distribution. 


10.10 Regression M-estimators: asymptotic normality 


10.10.1 Fixed X 


Now, to treat the asymptotic behavior of the estimator, we consider an infinite 
sequence (x;,y;) described by model (4.4). Call B,, the estimator (4.40) and 6G, the 
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scale estimator. Call X,, the matrix with rows x (i= 1,...,), which is assumed to 
have full rank. Then 


n 


! _ ’ 
X,X,, _ - X;X; 


i=] 


is positive definite, and hence it has a “square root”; that is, a (nonunique) p X p matrix 
R,, such that 
RR, = X/X (10.32) 


non n*n® 


Call 4, the smallest eigenvalue of X/,X,,, and define 


nN n? 


hin = X(X,,X1,) |X; (10.33) 


and 


M,, = max{h =1,...,n}. 


in + i 
Define R,, as in (10.32). Assume 


B1 lim A, = ©; 


B2 lim,.,..M, = 0. 


Noo 


Then we have 


Theorem 10.16 Assume conditions Al-A2—A3 of Section 10.6. If BI holds, then 
B, > B. If also B2 and A4—A5-A6 hold, then 


RB, a B) >aN,0, uD, (10.34) 


with v given by (10.21). 


The proof in a more general setting can be found in Yohai and Maronna (1979). 
For large n, the left-hand side of (10.34) has an approximate N,(, vI) distribution, 
and from this, (4.43) follows since R7'Rz"" = (X/X,)7!. 

When p = | (fitting a straight line through the origin), condition B! means that 
ie a — oo, which prevents the x; from clustering around the origin. Condition B2 


becomes 


which means that none of the ir dominate the sum in the denominator; that is, there 
are no leverage points. 
Now we consider a model with an intercept, namely (4.4). Let 


n 
x, = ave (X;) and C, = X (x; a x,,) (x; ~ ae 
i=! 
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Let T,, be any square root of C,,, that is, 


n? 


T.T, =C,,. 


n 


Theorem 10.17 Assume the same conditions as in Theorem 10.16 excepting A3 and 
A6. Then 
T, (Bry — B)) >aN,_/(0, cD. (10.35) 


We shall give a heuristic proof for the case of a straight line; that is, 
Vi = Bo + Bix; + Uj (10.36) 


so that x; = (1, x;). Put 


Then condition B1 is equivalent to 


—2 


x 
C, > co and + > 0. (10.37) 
n 
The first condition prevents the x; from clustering around a point. 
An auxiliary result will be needed. Let u;, i= 1,2,..., be i.i.d. variables with a 
finite variance, and for each n let aj,,,...,d,, be a set of constants. Let 


n 
ie = by GinVirs Yn = EV, %, = VartV,,). 


i=1 


Then W, =(V, —y,)/t, has zero mean and unit variance, and the central limit 
theorem asserts that if for each n we have a,; =... =4,,, then W,, >, N(0, 1). It 
can be shown that this is still valid if the a, are such that no term in V,, “dominates 
the sum”, in the following sense: 


in are such that 


Lemma 10.18 /f the a 


lim Lae ew. (10.38) 
noo > a? 


then W,, +4 N(O, 1. 


This result is a consequence of the so-called Lindeberg theorem (Feller, 1971). To 
see the need for condition (10.38), consider the uv; having a non-normal distribution G 
with unit variance. Take for each n: a,, = 1 and a,, = 0 fori > 1. Then V,,/z, = v,, 
which has distribution G for all n, and hence does not tend to the normal. 
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To demonstrate Theorem 10.17 for the model (10.36), let 


Xx. 
In 
T,, =V Ch, Lin Ba? 


so that 


= 


Zn = 0, ) = 1. (10.39) 


Then (10.33) becomes 


so that condition B2 is equivalent to 


max {z;,:i=1,...,n} +0. (10.40) 


mn 


We have to show that for large n, Bin is approximately normal, with variance 
v/C,,; that is, that T,,(B,, — B,) >, N(O, v). 
The estimating equations are 


n r: 
Yv (2) = 0, (10.41) 
i=l 


n 


Q 


vv @E = (10.42) 
. On 


i=1 


with r; = y; — (Bon + B inX;)- Combining both equations yields 


Yv (+) x* =0. (10.43) 
. oO 


Put 
Ay, = Py, = Pry, Aon = Bon — Po, Aon = 06, — 06. 


The Taylor expansion of y at ¢ is 
witte)=w) + ew'(d) + ole), (10.44) 
where the last term is higher-order infinitesimal. Writing 


in 


r= Uj; — (Bo, at Ai A(%;, +5,)) ’ 


REGRESSION M-ESTIMATORS: ASYMPTOTIC NORMALITY 393 


expanding y at u;/o and dropping the last term in (10.44) yields 


r; Uj — (Ao, ite Ain, de %,)) 
o+ Az, 


U; Ui; Bon + Aig xt, + %,) + Ay,u;/o 
nv (E)-w (Z) Settee Eales 


Oo 


Inserting (10.45) in (10.43), multiplying by o and dividing by T, 
(10.39), yields 


and recalling 


n? 


x x A x 
oA, = (T,A,,) (4, + ce] + AoC, + Ao, D> (10.46) 
n 
where 
n ii n 
A, = v ( E) in B,= dw! (=) a, 
i=1 i=1 
= ue u u 
C, = v' ( SY cin D, = Yv' (=) * Zin 
i=1 i=1 
Put 


2 
a= Ey (“) ,b=Ey’ (+) ,e= Var (y" ()) ‘ 
o o o 
Applying Lemma 10.18 to A,, (with v; = w(u,/o)), and recalling (10.39) and (10.40), 


yields A, >, N(0, a). The same procedure shows that C,, and D,, have normal limit 
distributions. Applying (10.39) to B,, yields 


EB, = b, Var(B,) =e )) Zi, 
i=1 


Now by (10.39) and (10.40) 


Yas; max Z7, > 0, 


l<i<n 


and then Tchebychev’s inequality implies that B,—,b. Recall that Kes #0 
by hypothesis, and Ay —,0 by Theorem 10.16. Since x,/T, > 0 by (10.37), 
application of Slutsky’s lemma to (10.46) yields the desired result. 

The asymptotic variance of Bo may be derived by inserting (10.45) in (10.41). 
The situation is similar to that of Section 10.6: if c = 0 (with c defined in (10.20)) the 
proof can proceed, otherwise the asymptotic variance of Bon depends on that of G,,. 
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10.10.2 Asymptotic normality: random X 


Since the observations z; = (Xx;,);) are i.i.d., this situation can be treated with the 
methods of Section 10.5. A regression M-estimator is a solution of 


» ¥G:,B)=0 
i=l 
with 


Wz, B) = xy(y — x’). 


We shall prove (5.14) for the case of o known and equal to one. It follows from 
(10.15) that the asymptotic covariance matrix of B is vVx!, with v given by (5.15) 
with o = 1, and V, = Exx’. In fact, the matrices A and B in (10.16)-(10.14) are 


A = Ew(y—x’/B)°xx’, B= —Ey'(y — x'B)xx’, 


and their existence is ensured by assuming that y and yw’ are bounded, and that 
E ||x||? < co. Under the model (5.1)—(5.2) we have 


A = Ey(u)’V,, B = —Ew'(w)V,, 


and the result follows immediately 


10.11 Regression M estimators: Fisher-consistency 


In this section we prove the Fisher-consistency (Section 3.5.3) of regression M 
estimators with random predictors. The case of monotone and redescending y will 
be considered separately. A general treatment of the consistency of regression M 
estimators is given in Fasano et al., 2012. 


10.11.1 Redescending estimators 
Let the random elements x € R? and y € R satisfy the linear model 
y=x'Byotu (10.47) 
with u independent of x. Let B = B(x, y) be an M-estimating functional defined as 
B = arg ming h (B) with 
h(B) = Ep(y — x’), (10.48) 
where p is a given p-function (Definition 2.1). It will be proved that, under certain 


conditions on p and the distributions of x and u, Bi is Fisher-consistent; that is, Bi is 
the unique minimizer of h. It will be assumed that: 
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e p isa differentiable bounded p-function. It may be assumed that sup,p(f) = 1. 

e The distribution F of u has a an even density f such that f(#) is nondecreasing in 
|t| and is decreasing in |¢| in a neighborhood of 0. 

e The distribution G of x satisfies P(B’x = 0) < 1 for all B 40. 

Theorem 10.19 (6) > h(B,) for all B F Bo. 
The proof requires some auxiliary results. 

Lemma 10.20 9 Put g(A) = Ep(u — A). Then: 

(i) gis even 


(ii) g(A) is nondecreasing for A > 0 
(iii) g(O) < g(A) for A #0. 


Proof of (i). Note that p(u — A) has the same distribution as p(—u — A) = p(u+ A), 
and therefore g(A) = g(—A). o 


Proof of (ii). For A > 0 and t > O call R, (A) the distribution function of v = |u— Al: 


A+t 
RO = P(lu- Al sO = f(s)ds. 
A-t 
Put R(t) = 0R,(t)/dA. Then 
R,)=fA+)—-fA-p) <0. (10.49) 


Let Ay be such that for t € [0, Ap], f(A) is decreasing and y(t) = p’(t) > 0. Then 
R(t) <Oif A< Ay and O<t< A, (10.50) 
If follows from 4 
R,@ = i R, (Ody 
that A, > A, > 0 implies Rj, (t) > R,,@), and therefore by (10.50) 
R,(t) > Ro(t), if O< A< Ap and 0<t< Ap. (10.51) 


Integrating by parts yields 


dt 


= OR 
g(A) = Ep(u — 2) = | ie nO 
0 t 


= R,(v)p(v)|° — i p'(v)R,(v)dv = 1 - i WOR, (dt 
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and if 0 < A, < A>, (10.49) implies 


8(Az) — 8(A1) = | w((R,,() — Ry, O)dt 2 0, (10.52) 
0 
which proves (11). | 


Proof of (iii). For 0 < A < Ap, we have 


foe) 


A 
x(a) - 9(0)= [ wlOKRolt) = Roar | w(t)(Ro(t) — R,(0)dt, 


and since for 0 < t < A (10.51) implies Ro(t) — R,(f) > Oand also y(t) > 0, it follows 
that 


Aa oo 
/ y(1)(Ro(t) — R,()dt > 0 and i y(t)(Ro(t) — R,())dt > 0, 
0 A 
and therefore 
g(A) > gO)ifO<A< Ap. (10.53) 


It follows from (10.52)-(10.53) that g(A) > g(0) if 0 < A < ow. The result for 
A < 0 follows from part (1) of the lemma. | 


Proof of the theorem. Let 6 4 By and put y = B — 6B) #0. Then 


h(B) = Ep(y — Bx) = Ep(u— y’x) = E(Elp(u — y’x)|x)). 


The independence of x and u implies that E[p(u — y’x)|x] = g(y’x), and therefore 
h(B) = Eg(y’x)). Besides, since h(0) = g(0), we have 


h(B) — h(0) = Eg(y’x)) — g(0). 


Part (iii) of the lemma implies that g(y’x) — g(0) > 0, and since P(y’x > 0) > 0, it 
follows that if we put W = g(y’x) — g9(0), then W > 0 and P(W > 0) > 0. Therefore 


h(B) — h(0) = E(W) > 0, 


which completes the proof. | 


10.11.2 Monotone estimators 


Let the random elements x € R? and y € R satisfy the linear model (10.47) with u 
independent of x. Let p= B(x, y) be an M-estimating functional defined as the solu- 
tion B of h(B) = 0 with 

h(B) = Ey(y — x’ B)x (10.54) 


where y is a given monotone y-function (Definition 2.2). It will be proved that under 
certain conditions on y and the distributions of x and u, B is Fisher-consistent; that 
is, B = Bo is the only solution of h(f) = 0. It will be assumed that: 
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(A) w is odd, continuous and nondecreasing, and there exists a such that y(f) is 
increasing on [0, a]. 

(B) The distributon F of u has an even density f that is positive on [—b, b] for some 
b. 

(C) The distribution G of x satisfies P(B’x = 0) < 1 for all B 4 0. 


Theorem 10.21 h(f) 4 0 for all BF Bo. 
The proof requires some auxiliary results. 
Lemma 10.22 Let g(A) = Ew(u — A). Then: 
(i) gis odd 


(ii) g is nonincreasing 


(iii) g(A) < Ofor A> 0. 


Proof of (i). Since uw and —u have the same distribution, and y is odd, then 
w(u—A) has the same distribution as w(—u—A)=-y(u+t A), and hence 
E(w(u — A)) = —E(w(u + A), which implies (i). ia 


Proof of (ii). Let A; < A,. Then 


8(A,) — 8(Ag) = E(w(u — 41) — w(u = a5) (10.55) 
Since y is nondecreasing and u — A, > u— A, we have w(u — 4,) — w(u— Az) = 0, 
which implies EQy(u — 4,) — y(u — A,)) = 0, and hence g(A,) — g(A,) = 0. a 


Proof of (iii). Note that by (A) and (B) there exists c such that for |t| <c, y(f) is 
increasing and f is positive. Let A € [0,c]. Applying (10.49) to A, = 0 and A, =c, 
and recalling that g(0) = 0, yields g(A) = —E(w(u) — w(u — A)) and hence 


g(A) = Ew(u— A) — wu) = [ — A) — w(O)f (dt. (10.56) 
Put R(t, A) = (w(t — A) — w(O)fo(). Then 
R@,A)<0O for tER and Az>O. (10.57) 
IfA > Oand0 <u <A < cit follows that |u — A| < c which implies 


R@,A) < OforA>OandO0<t<A<a. (10.58) 


Using (10.56) decompose R as 


0 A co 
g(A) = / R(t, A)dt + i R(t, A)du + / R(t, A)du 
= 0 a 


co 


=+ht+h 
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Then (10.57) implies that TS 0 for 1 <j < 3, and (10.58) implies J, < 0 and 
hence g(A) < 0 for A € [0, c]. At the same time, (11) implies that if A > a then g(A) < 
g(c) < 0, which proves (iii). | 


Lemma 10.23 Let g(A) = g(A)A. Then h(0)=0, q(A) is even and q(A) <0 
for A #0. 


The proof is an immediate consequence of the previous lemma. 


Proof of the theorem. We have 
h(f) = Ey(y — Bjx)x = Ew(u)x = (Ew(u))(Ex), 


and since Ey(u) = 0 we have h(B,) = 0. 
Let B # Bo; theny = B — By ¥ O. It will be shown that assuming h(B) = 0 yields 
a contradiction. For in that case 


y'h(B) = 0, (10.59) 
and therefore 
y'h(B) = Ey(y — p’x)y’x = E(u — y'x)y’x = E(E[y(u — y’x)y'x|x]) = Eq(y’x) 


But since P(y’x #0) > 0, Lemma 10.23 implies that Eg(y’x) < 0, and therefore 
h(B)'B < 0, which contradicts (10.59). | 


10.12 Nonexistence of moments of the sample median 


We shall show that there are extremely heavy-tailed distributions for which the sample 
median has no finite moments of any order. 

Let the sample {x,,...,x,,} have a continuous distribution function F and an odd 
sample size n = 2m + 1. Then its median 6, has distribution function G such that 


m 


P(6, >) =1-GH=)) (") Fay - FO)". (10.60) 


j=0 


In fact, let N = #(x; < 1), which is binomial Bi(n, F(4). Then 6, > tiff N < m, which 
yields (10.60). 

It is easy to show, using integration by parts, that if T is a nonnegative variable 
with distribution function G, then 


E(T) =k | * '(1 — G@))dt. (10.61) 
0 
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Now let 


F(@) = (1 a a) Ix> oe). 
logx 


Since for all positive r and s 


t" 
lim =o 
to log bp 


it follows from (10.61) that E60, K = oo for all positive k. 


10.13. Problems 


10.1. 


10.4. 
10.5. 


10.6. 


10.7. 


Let x),...,x, be ii.d. with continuous distribution function F’. Show that the 


sty 


distribution function of the order statistic x;,,) is 


n 


Go = Y) (7) Ftd - Foy 


k=m 


(Hint: for each ¢, the variable N, = #{x; < ¢} is binomial and verifies x, < 
t<—=>N, =m). 


. Let F be such that F(a) = F(b) = 0.5 for some a < Db. If x1, ..,X2,_) are 1.id. 


with distribution F’, show that the distribution of x; tends to the average of 
the point masses at a and b. 


. Let F,, = (1 —n7!)N(O, 1) +: n7'6,2, where 6, is the point-mass atx. Verify that 


F,, > N(O, 1), but its mean and variance tend to infinity. 


Verify that if x is symmetric about yu, then (10.23) is equal to p. 


Verify that if y is odd, p is even, and F symmetric, then V in (10.18) is diagonal; 
and compute the asymptotic variances of ff and 6. 


Show that scale M-estimators are uniquely defined if p(x) is strictly increasing 
for all x such that p(x) < p(co). (To make things easier, assume p differen- 
tiable). 


Show that the location estimator with Huber’s y;, and previous dispersion G is 
uniquely defined unless there exists a solution ff of YY", yy((x; — #)/6) = 0 
such that |x; — ff] > ke for all i. 
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Description of Datasets 


Here we describe the datasets used in the book 


Alcohol 


The solubility of alcohols in water is important in understanding alcohol transport 
in living organisms. This dataset from (Romanelli et al., 2001) contains physico- 
chemical characteristics of 44 aliphatic alcohols. The aim of the experiment was the 
prediction of the solubility on the basis of molecular descriptors. The columns are: 


SAG: solvent accessible surface-bounded molecular volume 
V: volume 

Log PC: (octanol-water partitions coefficient) 

P: polarizability 

RM: molar refractivity 

Mass 

In(solubility) (response) 


SO eS Pe 


Algae 


This dataset is part of a larger one (http://kdd.ics.uci.edu/databases/coil/coil.html), 
which comes from a water quality study where samples were taken from sites on dif- 
ferent European rivers over a period of approximately one year. These samples were 
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analyzed for various chemical substances. In parallel, algae samples were collected 
to determine the algal population distributions. The columns are: 


1. Season (1,2,3,4 for winter, spring, summer and autumn) 

2. River size (1,2,3 for small, medium and large) 

3. Fluid velocity (1,2,3 for low, medium and high) 

1. Content of nitrogen in the form of nitrates, nitrites and ammonia, and other 
chemical compounds 


4-1 


The response is the abundance of a type of algae (type 6 in the complete file). For 
simplicity we deleted the rows with missing values, or with null response values, and 
took the logarithm of the response. 


Aptitude 
There are three variables observed on 27 subjects: 


e Score: numeric, represents scores on an aptitude test for a course 

e Exp: numeric represents months of relevant previous experience 

e Pass: binary response, | if the subject passed the exam at the end of the course and 
0 otherwise. 


The data may be downloaded as dataset 6.2 from the site http://www.jeremymiles.co 
.uk/regressionbook/data/. 


Bus 


This dataset from the Turing Institute, Glasgow, Scotland, contains measures of shape 
features extracted from vehicle silhouettes. The images were acquired by a camera 
looking downwards at the model vehicle from a fixed angle of elevation. 

The following features were extracted from the silhouettes. 


. Compactness 

. Circularity 

. Distance circularity 

. Radius ratio 

. Principal axis aspect ratio 

. Maximum length aspect ratio 
. Scatter ratio 

. Elongatedness 

. Principal axis rectangularity 
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10. 
1, 
12. 
13. 
. Skewness about major axis 
. Skewness about minor axis 
. Kurtosis about minor axis 

. Kurtosis about major axis 

. Hollows ratio. 


Maximum length rectangularity 
Scaled variance along major axis 
Scaled variance along minor axis 
Scaled radius of gyration 


Glass 


This is part of a file donated by Vina Speihler, describing the composition of glass 
pieces from cars. The columns are: 


1. 


N 


AYA WwW 


RI refractive index 


. Na,O sodium oxide (unit measurement: weight percent in corresponding oxide, 


as are the rest of attributes) 


. MgO magnesium oxide 
. Al,O; aluminum oxide 
. SiO, silcon oxide 

. K,O potassium oxide 

. CaO calcium oxide. 


Hearing 


Prevalence rates in percent for men aged 55-64 with hearing levels 16 decibels or 
more above the audiometric zero. The rows correspond to different frequencies and 
to normal speech. 


. 500 Hz 


1000 Hz 


. 2000 Hz 
. 3000 Hz 
. 4000 Hz 
. 6000 Hz 
. Normal speech. 


The columns classify the data into seven occupational groups: 


1. 
2. 


Professional-managerial 
Farm 
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. Clerical sales 
. Craftsmen 

. Operatives 

. Service 

. Laborers. 


ADAM BW 


Image 


The data were supplied by A. Frery. They are a part of a synthetic aperture satellite 
radar image corresponding to a suburb of Munich. 


Krafft 


The Krafft point is an important physical characteristic of the compounds called sur- 
factants, establishing the minimum temperature at which a surfactant can be used. 
The purpose of the experiment was to estimate the Krafft point of compounds as a 
function of their molecular structure. 

The columns are: 


. Randig index 

. Volume of tail of molecule 

. Dipole moment of molecule 
. Heat of formation 

. Krafft point (response). 


nAkRWN Re 


Neuralgia 


The data come from a study on the effect of iontophoretic treatment on elderly 
patients complaining of post-herpetic neuralgia. There were eighteen patients in the 
study, who were interviewed six weeks after the initial treatment and were asked if 
the pain had been reduced. 

There are 18 observations on 5 variables: 


e Pain: binary response: | if the pain eased, 0 otherwise. 

e Treatment: binary variable: | if the patient underwent treatment, 0 otherwise. 
e Age: the age of the patient in completed years. 

e Gender: M (male) or F (female). 

e Duration: pretreatment duration of symptoms (in months) 
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Oats 


Yield of grain in grams per 16-foot row for each of eight varieties of oats in five 
replications in a randomized-block experiment. 


Solid waste 


The original data are the result of a study on production waste and land use by 
Golueke and McGauhey (1970), and contains nine variables. Here we consider the 
following six. 


. Industrial land (acres) 

. Fabricated metals (acres) 

. Trucking and wholesale trade (acres) 

. Retail trade (acres) 

. Restaurants and hotels (acres) 

. Solid waste (millions of tons), response. 


NMnBWNY 


Stack loss 


The columns are: 


. Air flow 

. Cooling water inlet temperature (°C) 

. Acid concentration (%) 

. Stack loss, defined as the percentage of ingoing ammonia that escapes unabsorbed 
(response). 


BWP ke 


Toxicity 


The aim of the experiment was to predict the toxicity of carboxylic acids on the basis 
of several molecular descriptors. The attributes for each acid are: 


: log(IGC5)): aquatic toxicity (response) 

. log Kow: partition coefficient 

. pKa: dissociation constant 

. ELUMO: energy of the lowest unoccupied molecular orbital 
. Ecarb: electrotopological state of the carboxylic group 


AB WN Re 


406 DESCRIPTION OF DATASETS 


6. Emet: electrotopological state of the methyl group 
7. RM: molar refractivity 

8. IR: refraction index 

9. Ts: surface tension 

0. P: polarizability. 


Wine 


This dataset, which is part of a larger one donated by Riccardo Leardi, gives the 
composition of several wines. The attributes are: 


. Alcohol 

. Malic acid 

Ash 

. Alcalinity of ash 
Magnesium 

. Total phenols 

. Flavanoids 

. Nonflavanoid phenols 
. Proanthocyanins 

. Color intensity 

. Hue 

. OD280/OD315 of diluted wines 
. Proline. 
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